My wishlist for 2026: Anthropic / OpenAI expose “how compaction is executed” to plugin authors for their CLI tools.
This technique should be something you could swap in for whatever Claude Code bakes in — but I don’t think the correct hooks or functionality is exposed.
It should be noted that OpenAI now has a specific compaction API which returns opaque encrypted items. This is AFAICT different from deciding when to compact, and many open source tools should indeed be inspectable to that regard.
There are no concepts in this blog post. It is the author's opinions in the form of a pseudo-Erlang program with probabilities. If one reads it like it is a program, you realize that the underlying core has been obfuscated by implementation details.
I'm looking for "the Emacs" of whatever this is, and I haven't read a blog post which isolates the design yet.
The article seems to be about fun, which I'm all for, and I highly appreciate the usage of MAKER as an evaluation task (finally, people are actually evaluating their theories on something quantitative) but the messaging here seems inherently contradictory:
> Gas Town helps with all that yak shaving, and lets you focus on what your Claude Codes are working on.
Then:
> Working effectively in Gas Town involves committing to vibe coding. Work becomes fluid, an uncountable that you sling around freely, like slopping shiny fish into wooden barrels at the docks. Most work gets done; some work gets lost. Fish fall out of the barrel. Some escape back to sea, or get stepped on. More fish will come. The focus is throughput: creation and correction at the speed of thought.
I see -- so where exactly is my focus supposed to sit?
As someone who sits comfortably in the "Stage 8" category that this article defines, my concern has never been throughput, it has always been about retaining a high-degree of quality while organizing work so that, when context switching occurs, it transitions me to near-orthogonal tasks which are easy to remember so I can give high-quality feedback before switching again.
For instance, I know Project A -- these are the concerns of Project A. I know Project B -- these are the concerns of Project B. I have the insight to design these projects so they compose, so I don't have to keep track of a hundred parallel issues in a mono Project C.
On each of those projects, run a single agent -- with review gates for 2-3 independent agents (fresh context, different models! Codex and Gemini). Use a loop, let the agents go back and forth.
This works and actually gets shit done. I'm not convinced that 20 Claudes or massively parallel worktrees or whatever improves on quality, because, indeed, I always have to intervene at some point. The blocker for me is not throughput, it's me -- a human being -- my focus, and the random points of intervention which ... by definition ... occur stochastically (because agents).
Finally:
> Opus 4.5 can handle any reasonably sized task, so your job is to make tasks for it. That’s it.
This is laughably not true, for anyone who has used Opus 4.5 for non-trivial tasks. Claude Code constantly gives up early, corrupts itself with self-bias, the list goes on and on. It's getting better, but it's not that good.
a response like this is confusing to me. what you are saying makes sense, but seems irrelevant. something like gas town is clearly not attempting to be a production grade tool. its an opinionated glimpse into the future. i think the astethic was fitting and intentional.
this is the equivalent of some crazy inventor in the 19th century strapping a steam engine onto a unicycle and telling you that some day youll be able to go 100mph on a bike. He was right in the end, but no one is actually going to build something usable with current technology.
Opus 4.5 isnt there. But will there be a model in 3-5 years thats smart enough, fast enough, and cheap enough for a refined vision of this to be possible? Im going to bet on yes to that question.
> something like gas town is clearly not attempting to be a production grade tool.
Compare to the first two sentences:
> Gas Town is a new take on the IDE for 2026. Gas Town helps you with the tedium of running lots of Claude Code instances. Stuff gets lost, it’s hard to track who’s doing what, etc. Gas Town helps with all that yak shaving, and lets you focus on what your Claude Codes are working on.
Compared to your read, my read is confused: is it or is it not intending to be a useful tool (we can debate "production" quality, here I'm just thinking something I'd actually use meaningfully -- like Claude Code)?
I think the author wants us to take this post seriously, so I'm taking it seriously, and my critique in the original post was a serious reaction.
... no one ever used crypto to buy things. most engineers are currently already using AI. such a dumb comparison that really just doesnt pass the sniff test.
People use crypto all the time to buy dollars. Thats its main purpose: spend sanctioned rubles to buy crypto to buy dollars; use randomware to coersively obtain crpyto to buy dollars, etc.
Inside scoop: the pub group who owned that pub (still going, owns four in Cambridge and environs) was cofounded by Steve Early, a Cambridge computer scientist who wrote his own POS software, so it was very much a case of "yeah, that sounds like fun, I'll add it". (Until tax and primary rate risk made it not fun, so it was removed.)
For anyone who takes doing their taxes seriously, this is a nightmare. Every pint ordered involves a capital gain (or loss) for the buyer. At a certain point you're doing enough accounting that you might as well be running the bar yourself (or just paying in cash)!
Meanwhile here I am at stage 0. I work on several projects where we are contractually obliged to not use any AI tools, even self-hosted ones. And AFAIK there's now a growing niche of mostly government projects with strict no-AI policy.
I’m luckily in a situation where I can afford to explore this stuff without the concerns that come from using it within an organization (and those concerns are 100% valid and haven’t been solved yet, especially not by this blog post)
> For instance, I know Project A -- these are the concerns of Project A. I know Project B -- these are the concerns of Project B. I have the insight to design these projects so they compose, so I don't have to keep track of a hundred parallel issues in a mono Project C. On each of those projects, run a single agent -- with review gates for 2-3 independent agents (fresh context, different models! Codex and Gemini). Use a loop, let the agents go back and forth.
Can you talk more about the structure of your workflow and how you evolved it to be that?
I've tried most of the agentic "let it rip" tools. Quickly I realized that GPT 5~ was significantly better at reasoning and more exhaustive than Claude Code (Opus, RL finetuned for Claude Code).
"What if Opus wrote the code, and GPT 5~ reviewed it?" I started evaluating this question, and started to get higher quality results and better control of complexity.
I could also trust this process to a greater degree than my previous process of trying to drive Opus, look at the code myself, try and drive Opus again, etc. Codex was catching bugs I would not catch with the same amount of time, including bugs in hard math, etc -- so I started having a great degree of trust in its reasoning capabilities.
It's a Claude Code plugin -- it combines the "don't let Claude stop until condition" (Stop hook) with a few CLI tools to induce (what the article calls) review gates: Claude will work indefinitely until the reviewer is satisfied.
In this case, the reviewer is a fresh Opus subagent which can invoke and discuss with Codex and Gemini.
One perspective I have which relates to this article is that the thing one wants to optimize for is minimizing the error per unit of work. If you have a dynamic programming style orchestration pattern for agents, you want the thing that solves the small unit of work (a task) to have as low error as possible, or else I suspect the error compounds quickly with these stochastic systems.
I'm trying this stuff for fairly advanced work (in a PhD), so I'm dogfooding ideas (like the ones presented in this article) in complex settings. I think there is still a lot of room to learn here.
I'm sure we're just working with the same tools thinking through the same ideas. Just curious if you've seen my newsletter/channel @enterprisevibecode https://www.enterprisevibecode.com/p/let-it-rip
No, you weren't clear, nor are you correct: you shared FUD about something it seems you have not tried, because testing your claims with a recent agentic system would dispel them.
I've had great success teaching Claude Code use DSLs I've created in my research. Trivially, it has never seen exactly these DSLs before -- yet it has correctly created complex programs using those DSLs, and indeed -- they work!
Have you had frontier agents work on programs in "esoteric" (unpopular) languages (pick: Zig, Haskell, Lisp, Elixir, etc)?
I don't see clarity, and I'm not sure if you've tried any of your claims for real.
If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!
Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).
Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.
Piggybacking on this post. Codex is not only finding much higher quality issues, it’s also writing code that usually doesn’t leave quality issues behind. Claude is much faster but it definitely leaves serious quality issues behind.
So much so that now I rely completely on Codex for code reviews and actual coding. I will pick higher quality over speed every day. Please don’t change it, OpenAI team!
Every plan Opus creates in Planning mode gets run through ChatGPT 5.2. It catches at least 3 or 4 serious issues that Claude didn’t think of. It typically takes 2 or 3 back and fourths for Claude to ultimately get it right.
I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.
The same thing can be said about Opus running through Opus.
It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.
My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.
That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.
While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.
Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.
It depends on the task but I have different Claude commands that have this role, usually I launch them from the same session. The command has the goal of doing an analysis and generating a md file that I can execute with a specific command and the md as parameter. It works quite well. The generated file is a thorough analysis of hundred of lines with specific coded content. It’s more precise that my few line prompt and help Claude stay on rails
Thanks for the tip. I was dubious, I tried GPT 5.2 for a start on a large plan and it was way better than reviewing it with Claude itself or Gemini. I then used it to help me with feature I was reviewing, it caught real discrepancies between the plan and the actual integration!
I'm happy to pay the same right now for less (on the max plan, or whatever) -- because I'm never running into limits, and I'm running these models near all day every day (as a single user working on my own personal projects).
I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?
I am on the $20 plan for CC and Codex, I feel like a session of usage on CC == ~20% Codex usage / 5 hours in terms of time spent inferencing. It has always seemed way more geneous than I would expect.
Agreed. The $20 plans can go very far when you're using the coding agent as an additional tool in your development flow, not just trying to hammer it with prompts until you get output that works.
Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly
It is ironic that in the gpt-4 era, when we couldn't see much value in this tools, all we could hear was "skill issues", "prompt engineering skills".
Now they are actually quite capable for SOME tasks, specially for something that we don't really care about learning, and they, to a certain extent, can generalize.
They perform much better than in gpt-4 era, objectively, across all domains. They perform much better with the absolute minimum input, objectively, across all domains.
If someone skipped the whole "prompt engineering" and learned nothing during that time, this person is more equiped to perform well.
Now I wonder how much I am leaving behind by ignoring this whole "skills, tools, MCP this and that, yada yada".
My answer is that the code they generate is still crap, so the new skill is in being able to spot the ways and places it wrote crap code, and how to quickly tell it to refactor to fix specific issues, and still come out ahead on productivity. Nothing like an ultra wide screen monitor (LG 40+) and having parallel codex or claude sessions going, working on a bunch of things at once in parallel. Get good at git worktree. Use them to make tools that make your own life easier that you previously wouldn't even have bothered to make. (chrome extensions and MCPs!)
The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.
Prompt engineering (communicating with models?) is a foundational skill. Skills, tools, MCPs, etc. are all built on prompts.
My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.
If I want to continue the same task, I run /compact
If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.
Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.
I'm not who you asked, but i do the same thing, i keep important state in doc files and recreate sessions from that state. this allows me to clear context and reconstruct my status on that item. I have a skill that manages this
Using documents for state helps so much with adding guardrails.
I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.
This is why claude code/codex cli is the way to go for me because often they can recompute the state from the minimal description automatically. If i relaly do needto stop the session and come back in i can point it at the task file. it also has docs and scaladocs/javadocs in key places. good naming and project structure helps it very easily find the data it needs without me needing to feed it specific files. I did the 'feed it files and copy paste the code snippet' thing in chatgpt for months. wish i went to claude code sooner.
I noticed I am not hitting limits either. My guess is OpenAI sees CC as a real competitor/serious threat. Had OAI not given me virtually unlimited use I probably would have jumped ship to CC by now. Burning tons of cash at this stage is likely Very Worth It to maintain "market leader" status if only in the eyes of the media/investors. It's going to be real hard to claw back current usage limits though.
If you look at benchmarks, the Claude models score significantly higher intelligence per token. I'm not sure how that works exactly, but they are offset from the entire rest of the chart on that metric. It seems they need less tokens to get the same result. (I can't speak for how that affects performance on very difficult tasks though, since most of mine are pretty straightforward.)
So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.
See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here
...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!
I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.
I wonder how much their revenue really ends up contributes towards covering their costs.
In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.
It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.
The depreciation schedule isn't as big a factor as you'd think.
The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.
> The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high.
How do you know this?
> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.
You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.
> And datacenter GPUs have been running inference workloads for years now,
And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.
> They're not throwing away two-year-old chips.
Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.
>> "In my mind, they're hardly making any money compared to how much they're spending"
> everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.
It has nothing to do with their management or investors being "dummies" but the numbers are the numbers.
OpenAI has data center rental costs approaching $620 billion, which is expected to rise to $1.4 trillion by 2033.
Annualized revenue is expected to be "only" $20 billion this year.
$1.4 trillion is 70x current revenue.
So unless they execute their strategy perfectly, hit all of their projections and hoping that neither the stock market or economy collapses, making a profit in the foreseeable future is highly unlikely.
They are drowning in debt and go into more and more ridiculous schemes to raise/get more money.
--- start quote ---
OpenAI has made $1.4 trillion in commitments to procure the energy and computing power it needs to fuel its operations in the future. But it has previously disclosed that it expects to make only $20 billion in revenues this year. And a recent analysis by HSBC concluded that even if the company is making more than $200 billion by 2030, it will still need to find a further $207 billion in funding to stay in business.
To me it seems that they're banking on it becoming indispensable. Right now I could go back to pre-AI and be a little disappointed but otherwise fine. I figure all of these AI companies are in a race to make themselves part of everyone's core workflow in life, like clothing or a smart phone, such that we don't have much of a choice as to whether we use it or not - it just IS.
That's what the investors are chasing, in my opinion.
It'll never be literally indispensible, because open models exist - either served by third-party providers, or even ran locally in a homelab setup. A nice thing that's arguably unique about the latter is that you can trade scale for latency - you get to run much larger models on the same hardware if they can chug on the answer overnight (with offload to fast SSD for bulk storage of parameters and activations) instead of just answering on the spot. Large providers don't want to do this, because keeping your query's activations around is just too expensive when scaled to many users.
Second this but for the chat subscription. Whatever they did with 5.2 compared to 5.0 in ChatGPT increased the test-time compute and the quality shows. If only they would allow more tokens to be submitted in one prompt (it's currently capped at 46k for Plus). I don't touch Gemini 3.0 Pro now (am also subbed there) unless I need the context length.
(unrelated, but piggybacking on requests to reach the teams)
If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.
Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.
OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :
absolutely second this. I'm mainly a claude code user, but i have codex running in another tab and for code reviews and it's absolutely killer at analyzing flows and finding subtle bugs.
Do you think that for someone who only needs careful, methodical identification of “problems” occasionally, like a couple of times per day, the $20/month plan gets you anywhere, or do you need the $200 plan just to get access to this?
I've had the $20/month plan for a few months alongside a max subscription to Claude; the cheap codex plan goes a really long way. I use it a few times a day for debugging, finding bugs, and reviewing my work. I've ran out of usage a couple of times, but only when I lean on it way more than I should.
I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.
Listening to Dario at the NYT DealBook summit, and reading between the lines a bit, it seems like he is basically saying Anthropic is trying to be a reponsible, sustainable business and charging customers accordingly, and insinuating that OpenAI is being much more reckless, financially.
I think it's difficult to estimate how profitable both are - depends too much on usage and that varies so much.
I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.
In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.
OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.
I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).
It's annoying though because it keeps (accurately) pointing out critical memory bugs that I clearly need to fix rather than pretending they aren't there. It's slowing me down.
Love it when it circles around a minor issue that I clearly described as temporary hack instead of recognizing the tremendously large gaping hole in my implementation right next to it.
Completely agreed. Used claude and codex both on highest tier next to each other for a month. On complex tasks where Claude would get stuck and not be able to fix it at all, codex would fix the issue in one go. Codex is amazing.
I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class.
Was using 5.2 on medium (default).
Agree. Codex just read my source code for a toy lisp I wrote in ARM64 assembly and learned how to code in that lisp and wrote a few demo programs for me. The was impressive enough. Then it spent some time and effort to really hunt down some problems--there was a single bit mask error in my garbage collector that wasn't showing up until then. I was blown away. It's the kind of thing I would have spent forever trying to figure out before.
I've been writing a little port of the SeL4 OS kernel to rust, mostly as a learning exercise. I ran into a weird bug yesterday where some of my code wasn't running - qemu was just exiting. And I couldn't figure out why.
I asked codex to take a look. It took a couple minutes, but it managed to track the issue down using a bunch of tricks I've never seen before. I was blown away. In particular, it reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!
Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.
Interestingly it found a GC bug in my toy Lisp that I wrote in Z80 assembly almost 30 years ago. This kind of work appears to be more common than you'd think!
Agreed, I'm surprised how much much care the "extra high" reasoning allows. It easily catches bugs in code other LLMs won't, using it to review Opus 4.5 is highly effective.
Exactly. This is why the workflow of consulting Gemini/Codex for architecture and overall plan, and then have Claude implement the changes is so powerful.
If by "just breaks" means "refuses to write code / gives up or reverts what it does" -- yes, I've experienced that.
Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.
I basically use it to drive Claude Code, which will nuke the codebase with abandon.
Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.
I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.
I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.
This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.
I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.
If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.
It's pretty easy to hit a couple hundred dollars a day filling up Opus's context window with files. This is via Anthropic API and Zed.
Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.
I can't bear to go in circles with Sonnet when Opus can just one shot it.
Anthropic via Azure has sent me an invoice of around $8000 for 3-5 days of Opus 4.1 usage and there is no way to track how many tokens during those days and how many cached etc. (And I thought its part of the azure sponsorship but that's another story)
That's only part of the reason that this type of content is used in academic papers. The other part is that you never know what PhD student / postdoc / researcher will be reviewing your paper, which means you are incentivized to be liberal with citations (however tangential) just in case someone is reading your paper, and has the reaction "why didn't they cite this work, of which I had some role in?"
Papers with a fake air of authority of easily dispatched with. What is not so easily dispatched with is the politics of the submission process.
This type of content is fundamentally about emotions (in the reviewer of your paper), and emotions is undeniably a large factor in acceptance / rejection.
Indeed. One can even game review systems by leaving errors in for the reviewers to find so that they feel good about themselves and that they've done their job. The meta-science game is toxic and full of politics and ego-pleasing.
reply