More

mccoyb · 2026-01-09T20:22:48 1767990168

Who knew that these massive high-dimensional probability distributions would drive us insane

mccoyb · 2026-01-07T07:37:46 1767771466

It codes faster and with more abandon. For good results, mix Claude Code with Codex (preferably high or xhigh reasoning) for reviews.

copperx · 2026-01-07T18:35:07 1767810907

Thanks. The reason for my hesitancy is that I've heard that the $20 sub isn't enough for anything meaningful.

mccoyb · 2026-01-03T17:13:33 1767460413

My wishlist for 2026: Anthropic / OpenAI expose “how compaction is executed” to plugin authors for their CLI tools.

This technique should be something you could swap in for whatever Claude Code bakes in — but I don’t think the correct hooks or functionality is exposed.

rockwotj · 2026-01-03T20:48:49 1767473329

Isn’t codex open source and you can just go read what they do?

I have read the gemini source and it’s a pretty simple prompt to summarize everything when the context window is full

MillionOClock · 2026-01-03T21:30:36 1767475836

It should be noted that OpenAI now has a specific compaction API which returns opaque encrypted items. This is AFAICT different from deciding when to compact, and many open source tools should indeed be inspectable to that regard.

omneity · 2026-01-03T23:24:04 1767482644

It's likely to either be an approach like this [0] or something even less involved.

0: https://github.com/apple/ml-clara

mccoyb · 2026-01-03T03:04:49 1767409489

There are no concepts in this blog post. It is the author's opinions in the form of a pseudo-Erlang program with probabilities. If one reads it like it is a program, you realize that the underlying core has been obfuscated by implementation details.

I'm looking for "the Emacs" of whatever this is, and I haven't read a blog post which isolates the design yet.

leftbehinds · 2026-01-05T18:25:38 1767637538

excellent summary, thanks

mccoyb · 2026-01-01T23:25:16 1767309916

The article seems to be about fun, which I'm all for, and I highly appreciate the usage of MAKER as an evaluation task (finally, people are actually evaluating their theories on something quantitative) but the messaging here seems inherently contradictory:

> Gas Town helps with all that yak shaving, and lets you focus on what your Claude Codes are working on.

Then:

> Working effectively in Gas Town involves committing to vibe coding. Work becomes fluid, an uncountable that you sling around freely, like slopping shiny fish into wooden barrels at the docks. Most work gets done; some work gets lost. Fish fall out of the barrel. Some escape back to sea, or get stepped on. More fish will come. The focus is throughput: creation and correction at the speed of thought.

I see -- so where exactly is my focus supposed to sit?

As someone who sits comfortably in the "Stage 8" category that this article defines, my concern has never been throughput, it has always been about retaining a high-degree of quality while organizing work so that, when context switching occurs, it transitions me to near-orthogonal tasks which are easy to remember so I can give high-quality feedback before switching again.

For instance, I know Project A -- these are the concerns of Project A. I know Project B -- these are the concerns of Project B. I have the insight to design these projects so they compose, so I don't have to keep track of a hundred parallel issues in a mono Project C.

On each of those projects, run a single agent -- with review gates for 2-3 independent agents (fresh context, different models! Codex and Gemini). Use a loop, let the agents go back and forth.

This works and actually gets shit done. I'm not convinced that 20 Claudes or massively parallel worktrees or whatever improves on quality, because, indeed, I always have to intervene at some point. The blocker for me is not throughput, it's me -- a human being -- my focus, and the random points of intervention which ... by definition ... occur stochastically (because agents).

Finally:

> Opus 4.5 can handle any reasonably sized task, so your job is to make tasks for it. That’s it.

This is laughably not true, for anyone who has used Opus 4.5 for non-trivial tasks. Claude Code constantly gives up early, corrupts itself with self-bias, the list goes on and on. It's getting better, but it's not that good.

anthonypasq · 2026-01-05T17:14:41 1767633281

a response like this is confusing to me. what you are saying makes sense, but seems irrelevant. something like gas town is clearly not attempting to be a production grade tool. its an opinionated glimpse into the future. i think the astethic was fitting and intentional.

this is the equivalent of some crazy inventor in the 19th century strapping a steam engine onto a unicycle and telling you that some day youll be able to go 100mph on a bike. He was right in the end, but no one is actually going to build something usable with current technology.

Opus 4.5 isnt there. But will there be a model in 3-5 years thats smart enough, fast enough, and cheap enough for a refined vision of this to be possible? Im going to bet on yes to that question.

mccoyb · 2026-01-06T01:28:26 1767662906

I think this read is generous:

> something like gas town is clearly not attempting to be a production grade tool.

Compare to the first two sentences:

> Gas Town is a new take on the IDE for 2026. Gas Town helps you with the tedium of running lots of Claude Code instances. Stuff gets lost, it’s hard to track who’s doing what, etc. Gas Town helps with all that yak shaving, and lets you focus on what your Claude Codes are working on.

Compared to your read, my read is confused: is it or is it not intending to be a useful tool (we can debate "production" quality, here I'm just thinking something I'd actually use meaningfully -- like Claude Code)?

I think the author wants us to take this post seriously, so I'm taking it seriously, and my critique in the original post was a serious reaction.

alexjurkiewicz · 2026-01-06T08:49:14 1767689354

The blog post says, many times, not to use Gastown. It makes fun of the tool's inconsistent branding and describes a lot of jankiness.

This tool is dangerous, largely untested, and yet may be of interest if you are already doing similar things in production.

leftbehinds · 2026-01-05T18:34:44 1767638084

in 3-5years, sure, just like we are all currently using crypto to pay for groceries and smart contracts for all legal matters.

anthonypasq · 2026-01-05T19:04:13 1767639853

... no one ever used crypto to buy things. most engineers are currently already using AI. such a dumb comparison that really just doesnt pass the sniff test.

dzdt · 2026-01-06T13:38:13 1767706693

People use crypto all the time to buy dollars. Thats its main purpose: spend sanctioned rubles to buy crypto to buy dollars; use randomware to coersively obtain crpyto to buy dollars, etc.

jbl0ndie · 2026-01-05T23:54:04 1767657244

Not quite true. This pub's changed hands now but it was possible to pay in bitcoin for several years.

https://www.wired.com/story/london-bitcoin-pub/

adw · 2026-01-06T03:38:12 1767670692

Inside scoop: the pub group who owned that pub (still going, owns four in Cambridge and environs) was cofounded by Steve Early, a Cambridge computer scientist who wrote his own POS software, so it was very much a case of "yeah, that sounds like fun, I'll add it". (Until tax and primary rate risk made it not fun, so it was removed.)

The POS software's on GitHub: https://github.com/sde1000/quicktill

benregenspan · 2026-01-06T00:12:05 1767658325

For anyone who takes doing their taxes seriously, this is a nightmare. Every pint ordered involves a capital gain (or loss) for the buyer. At a certain point you're doing enough accounting that you might as well be running the bar yourself (or just paying in cash)!

leipert · 2026-01-06T02:43:00 1767667380

Depends. If you hold crypto for more than a year in Germany, gains are tax free.

Quarrelsome · 2026-01-06T02:18:08 1767665888

people use crypto to buy black market goods like drugs. Its incredibly reliable to buy drugs with.

fragmede · 2026-01-05T22:30:46 1767652246

Their green username is leftbehinds. Let them hold their wrong opinions based on outdated information.

andrewl-hn · 2026-01-06T10:01:47 1767693707

Meanwhile here I am at stage 0. I work on several projects where we are contractually obliged to not use any AI tools, even self-hosted ones. And AFAIK there's now a growing niche of mostly government projects with strict no-AI policy.

mccoyb · 2026-01-06T12:25:35 1767702335

I’m luckily in a situation where I can afford to explore this stuff without the concerns that come from using it within an organization (and those concerns are 100% valid and haven’t been solved yet, especially not by this blog post)

iamwil · 2026-01-01T23:42:49 1767310969

> For instance, I know Project A -- these are the concerns of Project A. I know Project B -- these are the concerns of Project B. I have the insight to design these projects so they compose, so I don't have to keep track of a hundred parallel issues in a mono Project C. On each of those projects, run a single agent -- with review gates for 2-3 independent agents (fresh context, different models! Codex and Gemini). Use a loop, let the agents go back and forth.

Can you talk more about the structure of your workflow and how you evolved it to be that?

mccoyb · 2026-01-02T00:16:25 1767312985

I've tried most of the agentic "let it rip" tools. Quickly I realized that GPT 5~ was significantly better at reasoning and more exhaustive than Claude Code (Opus, RL finetuned for Claude Code).

"What if Opus wrote the code, and GPT 5~ reviewed it?" I started evaluating this question, and started to get higher quality results and better control of complexity.

I could also trust this process to a greater degree than my previous process of trying to drive Opus, look at the code myself, try and drive Opus again, etc. Codex was catching bugs I would not catch with the same amount of time, including bugs in hard math, etc -- so I started having a great degree of trust in its reasoning capabilities.

I've codified this workflow into a plugin which I've started developing recently: https://github.com/evil-mind-evil-sword/idle

It's a Claude Code plugin -- it combines the "don't let Claude stop until condition" (Stop hook) with a few CLI tools to induce (what the article calls) review gates: Claude will work indefinitely until the reviewer is satisfied.

In this case, the reviewer is a fresh Opus subagent which can invoke and discuss with Codex and Gemini.

One perspective I have which relates to this article is that the thing one wants to optimize for is minimizing the error per unit of work. If you have a dynamic programming style orchestration pattern for agents, you want the thing that solves the small unit of work (a task) to have as low error as possible, or else I suspect the error compounds quickly with these stochastic systems.

I'm trying this stuff for fairly advanced work (in a PhD), so I'm dogfooding ideas (like the ones presented in this article) in complex settings. I think there is still a lot of room to learn here.

mlady · 2026-01-02T21:03:34 1767387814

I'm sure we're just working with the same tools thinking through the same ideas. Just curious if you've seen my newsletter/channel @enterprisevibecode https://www.enterprisevibecode.com/p/let-it-rip

It's cool to see others thinking the same thing!

mccoyb · 2025-12-23T19:05:00 1766516700

No, you weren't clear, nor are you correct: you shared FUD about something it seems you have not tried, because testing your claims with a recent agentic system would dispel them.

I've had great success teaching Claude Code use DSLs I've created in my research. Trivially, it has never seen exactly these DSLs before -- yet it has correctly created complex programs using those DSLs, and indeed -- they work!

Have you had frontier agents work on programs in "esoteric" (unpopular) languages (pick: Zig, Haskell, Lisp, Elixir, etc)?

I don't see clarity, and I'm not sure if you've tried any of your claims for real.

smolder · 2025-12-24T08:57:37 1766566657

This is desperate rebuttal from ignorance.

My point stands. You haven't innovated, you've just leaned on an LLM to work with your unoriginal DSL. I'm sure it's worth 100 megawatt-hours.

mccoyb · 2026-01-03T03:14:27 1767410067

I see -- unoriginal, but it was accepted for publication at a leading programming languages venue?

Do some experiments, stop speaking out of your ass.

mccoyb · 2025-12-18T18:31:30 1766082690

If anyone from OpenAI is reading this -- a plea to not screw with the reasoning capabilities!

Codex is so so good at finding bugs and little inconsistencies, it's astounding to me. Where Claude Code is good at "raw coding", Codex/GPT5.x are unbeatable in terms of careful, methodical finding of "problems" (be it in code, or in math).

Yes, it takes longer (quality, not speed please!) -- but the things that it finds consistently astound me.

sinatra · 2025-12-19T00:59:17 1766105957

Piggybacking on this post. Codex is not only finding much higher quality issues, it’s also writing code that usually doesn’t leave quality issues behind. Claude is much faster but it definitely leaves serious quality issues behind.

So much so that now I rely completely on Codex for code reviews and actual coding. I will pick higher quality over speed every day. Please don’t change it, OpenAI team!

F7F7F7 · 2025-12-19T02:32:48 1766111568

Every plan Opus creates in Planning mode gets run through ChatGPT 5.2. It catches at least 3 or 4 serious issues that Claude didn’t think of. It typically takes 2 or 3 back and fourths for Claude to ultimately get it right.

I’m in Claude Code so often (x20 Max) and I’m so comfortable with my environment setup with hooks (for guardrails and context) that I haven’t given Codex a serious shot yet.

SkyPuncher · 2025-12-19T03:50:01 1766116201

The same thing can be said about Opus running through Opus.

It's often not that a different model is better (well, it still has to be a good model). It's that the different chat has a different objective - and will identify different things.

sinatra · 2025-12-19T04:47:06 1766119626

My (admittedly one person's anecdotal) experience has been that when I ask Codex and Claude to make a plan/fix and then ask them both to review it, they both agree that Codex's version is better quality. This is on a 140K LOC codebase with an unreasonable amount of time spent on rules (lint, format, commit, etc), on specifying coding patterns, on documenting per workspace README.md, etc.

pietz · 2025-12-19T11:14:23 1766142863

That's a fair point and yet I deeply believe Codex is better here. After finishing a big task, I used two fresh instances of Claude and Codex to review it. Codex finds more issues in ~9 out of 10 cases.

While I prefer the way Claude speaks and writes code, there is no doubt that whatever Codex does is more thorough.

shinycode · 2025-12-19T09:14:55 1766135695

Every time Claude Code finishes a task, I plan a full review of its own task with a very detailed plan and it catches itself many things it didn’t see before. It works well and it’s part of the process of refinement. We all know it’s almost never 100% hit of the first try on big chunks of code generated.

a24j · 2025-12-19T11:10:04 1766142604

How exactly do you plan/initiate a review from the terminal? open up a new shell/instance of claude and initiate the review with fresh context?

shinycode · 2025-12-24T20:52:32 1766609552

It depends on the task but I have different Claude commands that have this role, usually I launch them from the same session. The command has the goal of doing an analysis and generating a md file that I can execute with a specific command and the md as parameter. It works quite well. The generated file is a thorough analysis of hundred of lines with specific coded content. It’s more precise that my few line prompt and help Claude stay on rails

fragmede · 2025-12-19T11:15:19 1766142919

Yeah. It dumps context into various .md files, like TODO.md.

derfurth · 2025-12-21T13:21:19 1766323279

Thanks for the tip. I was dubious, I tried GPT 5.2 for a start on a large plan and it was way better than reviewing it with Claude itself or Gemini. I then used it to help me with feature I was reviewing, it caught real discrepancies between the plan and the actual integration!

lostmsu · 2025-12-21T20:23:17 1766348597

This makes me think: are there any "pair-programming" vibecoding tools that would use two different models and have them check each other?

AmazingTurtle · 2025-12-19T09:18:40 1766135920

Have you tried telling Claude not to leave serious quality issues behind?

ifwinterco · 2025-12-18T18:42:34 1766083354

I think the issue is for them "quality, not speed" means "expensive, not cheap" and they can't pass that extra cost on to customers

mccoyb · 2025-12-18T18:46:12 1766083572

I'm happy to pay the same right now for less (on the max plan, or whatever) -- because I'm never running into limits, and I'm running these models near all day every day (as a single user working on my own personal projects).

I consistently run into limits with CC (Opus 4.5) -- but even though Codex seems to be spending significantly more tokens, it just seems like the quota limit is much higher?

Computer0 · 2025-12-18T19:02:31 1766084551

I am on the $20 plan for CC and Codex, I feel like a session of usage on CC == ~20% Codex usage / 5 hours in terms of time spent inferencing. It has always seemed way more geneous than I would expect.

Aurornis · 2025-12-18T20:38:39 1766090319

Agreed. The $20 plans can go very far when you're using the coding agent as an additional tool in your development flow, not just trying to hammer it with prompts until you get output that works.

Managing context goes a long way, too. I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

girvo · 2025-12-18T21:40:29 1766094029

> I clear context for every new task and keep the local context files up to date with key info to get the LLM on target quickly

Aggressively recreating your context is still the best way to get the best results from these tools too, so it has a secondary benefit.

heliumtera · 2025-12-18T23:54:01 1766102041

It is ironic that in the gpt-4 era, when we couldn't see much value in this tools, all we could hear was "skill issues", "prompt engineering skills". Now they are actually quite capable for SOME tasks, specially for something that we don't really care about learning, and they, to a certain extent, can generalize. They perform much better than in gpt-4 era, objectively, across all domains. They perform much better with the absolute minimum input, objectively, across all domains. If someone skipped the whole "prompt engineering" and learned nothing during that time, this person is more equiped to perform well. Now I wonder how much I am leaving behind by ignoring this whole "skills, tools, MCP this and that, yada yada".

fragmede · 2025-12-19T11:44:18 1766144658

My answer is that the code they generate is still crap, so the new skill is in being able to spot the ways and places it wrote crap code, and how to quickly tell it to refactor to fix specific issues, and still come out ahead on productivity. Nothing like an ultra wide screen monitor (LG 40+) and having parallel codex or claude sessions going, working on a bunch of things at once in parallel. Get good at git worktree. Use them to make tools that make your own life easier that you previously wouldn't even have bothered to make. (chrome extensions and MCPs!)

The other skill is in knowing exactly when to roll up your sleeves and do it the old fashioned way. Which things they're good/useful for, and which things they aren't.

conradev · 2025-12-19T02:44:31 1766112271

Prompt engineering (communicating with models?) is a foundational skill. Skills, tools, MCPs, etc. are all built on prompts.

My take is that the overlap is strongest with engineering management. If you can learn how to manage a team of human engineers well, that translates to managing a team of agents well.

lukan · 2025-12-19T16:41:19 1766162479

And if learned how to articulate assignments for humans right in a clear way, you likely also can do a prompt right.

None of that knowlege will get useless, only working around current limitations of agents will.

miek · 2025-12-19T05:42:41 1766122961

Minimal prompting yielding better results? I haven't found this to be the case at all.

neom · 2025-12-19T00:54:21 1766105661

Any thoughts on your wondering? I too am wondering about the same mistake I might be making.

theonething · 2025-12-18T23:04:26 1766099066

do you mean running /compact often?

Aurornis · 2025-12-19T14:24:42 1766154282

If I want to continue the same task, I run /compact

If I want to start a new task, I /clear and then tell it to re-read the CLAUDE.md document where I put all of the quick context: Description of the project, key goals, where to find key code, reminders for tools to use, and so on. I aggressively update this file as I notice things that it’s always forgetting or looking up. I know some people have the LLM update their context file but I just do it myself with seemingly better results.

Using /compact burns through a lot of your usage quota and retains a lot of things you may not need. Giving it directions like “starting a new task doing ____, only keep necessary context for that” can help, but hitting /clear and having it re-read a short context primer is faster and uses less quota.

dionian · 2025-12-19T05:43:28 1766123008

I'm not who you asked, but i do the same thing, i keep important state in doc files and recreate sessions from that state. this allows me to clear context and reconstruct my status on that item. I have a skill that manages this

joquarky · 2025-12-19T06:29:30 1766125770

Using documents for state helps so much with adding guardrails.

I do wish that ChatGPT had a toggle next to each project file instead of having to delete and reupload to toggle or create separate projects for various combinations of files.

dionian · 2025-12-20T03:18:13 1766200693

This is why claude code/codex cli is the way to go for me because often they can recompute the state from the minimal description automatically. If i relaly do needto stop the session and come back in i can point it at the task file. it also has docs and scaladocs/javadocs in key places. good naming and project structure helps it very easily find the data it needs without me needing to feed it specific files. I did the 'feed it files and copy paste the code snippet' thing in chatgpt for months. wish i went to claude code sooner.

hadlock · 2025-12-19T00:00:52 1766102452

I noticed I am not hitting limits either. My guess is OpenAI sees CC as a real competitor/serious threat. Had OAI not given me virtually unlimited use I probably would have jumped ship to CC by now. Burning tons of cash at this stage is likely Very Worth It to maintain "market leader" status if only in the eyes of the media/investors. It's going to be real hard to claw back current usage limits though.

andai · 2025-12-18T20:44:41 1766090681

If you look at benchmarks, the Claude models score significantly higher intelligence per token. I'm not sure how that works exactly, but they are offset from the entire rest of the chart on that metric. It seems they need less tokens to get the same result. (I can't speak for how that affects performance on very difficult tasks though, since most of mine are pretty straightforward.)

So if you look at the total cost of running the benchmark, it's surprisingly similar to other models -- the higher price per token is offset by the significantly fewer tokens required to complete a task.

See "Cost to Run Artificial Analysis Index" and "Intelligence vs Output Tokens" here

https://artificialanalysis.ai/

...With the obligatory caveat that benchmarks are largely irrelevant for actual real world tasks and you need to test the thing on your actual task to see how well it does!

tejohnso · 2025-12-19T00:22:42 1766103762

> they can't pass that extra cost on to customers

I don't understand why not. People pay for quality all the time, and often they're begging to pay for quality, it's just not an option. Of course, it depends on how much more quality is being offered, but it sounds like a significant amount here.

golly_ned · 2025-12-18T21:32:22 1766093542

I wonder how much their revenue really ends up contributes towards covering their costs.

In my mind, they're hardly making any money compared to how much they're spending, and are relying on future modeling and efficiency gains to be able to reduce their costs but are pursuing user growth and engagement almost fully -- the more queries they get, the more data they get, the bigger a data moat they can build.

erik · 2025-12-19T00:11:47 1766103107

Inference is almost certainly very profitable.

All the money they keep raising goes to R&D for the next model. But I don't see how they ever get off that treadmill.

ithkuil · 2025-12-19T07:39:57 1766129997

Is there a possible future where the inference usage increases because there will be many many more customers and R&D grows Lower than inference?

Or is it already saturated?

mbesto · 2025-12-19T13:34:38 1766151278

> Inference is almost certainly very profitable.

It almost certainly is not. Until we know what the useful life of NVIDIA GPUs are, then it's impossible to determine whether this is profitable or not.

panarky · 2025-12-19T15:37:01 1766158621

The depreciation schedule isn't as big a factor as you'd think.

The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high. You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

And datacenter GPUs have been running inference workloads for years now, so companies have a good idea of rates of failure and obsolescence. They're not throwing away two-year-old chips.

mbesto · 2025-12-19T16:34:53 1766162093

> The marginal cost of an API call is small relative to what users pay, and utilization rates at scale are pretty high.

How do you know this?

> You don't need perfect certainty about GPU lifespan to see that the spread between cost-per-token and revenue-per-token leaves a lot of room.

You can't even speculate this spread without knowing even a rough idea of cost-per-token. Currently, it's total paper math on what the cost-per-token is.

> And datacenter GPUs have been running inference workloads for years now,

And inference resource intensity is a moving target. If a new model comes out that requires 2x the amount of resources now.

> They're not throwing away two-year-old chips.

Maybe, but they'll be replaced by either (a) a higher performance GPU that can deliver the same results with less energy, less physical density, and less cooling or (b) the extended support costs becomes financially untenable.

Leynos · 2025-12-20T10:19:51 1766225991

If a model costs them 2x as much, they charge 2x as much. That much is clear from their API pricing.

nimchimpsky · 2025-12-18T21:40:18 1766094018

"In my mind, they're hardly making any money compared to how much they're spending"

everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.

They are obviously making awful lot of revenue.

alwillis · 2025-12-19T05:14:17 1766121257

>> "In my mind, they're hardly making any money compared to how much they're spending"

> everyone seems to assume this, but its not like its a company run by dummies, or has dummy investors.

It has nothing to do with their management or investors being "dummies" but the numbers are the numbers.

OpenAI has data center rental costs approaching $620 billion, which is expected to rise to $1.4 trillion by 2033.

Annualized revenue is expected to be "only" $20 billion this year.

$1.4 trillion is 70x current revenue.

So unless they execute their strategy perfectly, hit all of their projections and hoping that neither the stock market or economy collapses, making a profit in the foreseeable future is highly unlikely.

[1]: "OpenAI's AI money pit looks much deeper than we thought. Here's my opinion on why this matters" - https://diginomica.com/openais-ai-money-pit-much-deeper-we-t...

troupo · 2025-12-18T23:09:32 1766099372

Revenue != profit.

They are drowning in debt and go into more and more ridiculous schemes to raise/get more money.

--- start quote ---

OpenAI has made $1.4 trillion in commitments to procure the energy and computing power it needs to fuel its operations in the future. But it has previously disclosed that it expects to make only $20 billion in revenues this year. And a recent analysis by HSBC concluded that even if the company is making more than $200 billion by 2030, it will still need to find a further $207 billion in funding to stay in business.

https://finance.yahoo.com/news/openai-partners-carrying-96-b...

--- end quote ---

Daneel_ · 2025-12-18T23:09:23 1766099363

To me it seems that they're banking on it becoming indispensable. Right now I could go back to pre-AI and be a little disappointed but otherwise fine. I figure all of these AI companies are in a race to make themselves part of everyone's core workflow in life, like clothing or a smart phone, such that we don't have much of a choice as to whether we use it or not - it just IS.

That's what the investors are chasing, in my opinion.

zozbot234 · 2025-12-18T23:29:55 1766100595

It'll never be literally indispensible, because open models exist - either served by third-party providers, or even ran locally in a homelab setup. A nice thing that's arguably unique about the latter is that you can trade scale for latency - you get to run much larger models on the same hardware if they can chug on the answer overnight (with offload to fast SSD for bulk storage of parameters and activations) instead of just answering on the spot. Large providers don't want to do this, because keeping your query's activations around is just too expensive when scaled to many users.

mbesto · 2025-12-19T13:37:13 1766151433

> They are obviously making awful lot of revenue.

It's not hard to sell $10 worth of products if you spend $20. profit is more important than revenue.

zozbot234 · 2025-12-18T23:21:41 1766100101

The "quality" model can cost $200/month. They'll be fine.

energy123 · 2025-12-19T05:48:07 1766123287

Second this but for the chat subscription. Whatever they did with 5.2 compared to 5.0 in ChatGPT increased the test-time compute and the quality shows. If only they would allow more tokens to be submitted in one prompt (it's currently capped at 46k for Plus). I don't touch Gemini 3.0 Pro now (am also subbed there) unless I need the context length.

echelon · 2025-12-18T20:09:42 1766088582

> If anyone from OpenAI is reading this

(unrelated, but piggybacking on requests to reach the teams)

If anyone from OpenAI or Google is reading this, please continue to make your image editing models work with the "previz-to-render" workflow.

Image edits should strongly infer pose and blocking as an internal ControlNet, but should be able to upscale low-fidelity mannequins, cutouts, and plates/billboards.

OpenAI kicks ass at this (but could do better with style controls - if I give a Midjourney style ref, use it) :

https://imgur.com/gallery/previz-to-image-gpt-image-1-x8t1ij...

https://imgur.com/a/previz-to-image-gpt-image-1-5-3fq042U

Google fails the tests currently, but can probably easily catch up :

https://imgur.com/a/previz-to-image-nano-banana-pro-Q2B8psd

baseonmars · 2025-12-18T21:49:19 1766094559

absolutely second this. I'm mainly a claude code user, but i have codex running in another tab and for code reviews and it's absolutely killer at analyzing flows and finding subtle bugs.

mkagenius · 2025-12-19T03:12:03 1766113923

Have you tried Claude Code in the second tab instead, that would be a fair comparison.

hugh-avherald · 2025-12-19T22:14:04 1766182444

Claude Code isn't as surgical as Codex at reviews

smoe · 2025-12-19T00:26:30 1766103990

Do you think that for someone who only needs careful, methodical identification of “problems” occasionally, like a couple of times per day, the $20/month plan gets you anywhere, or do you need the $200 plan just to get access to this?

hatefulmoron · 2025-12-19T01:10:52 1766106652

I've had the $20/month plan for a few months alongside a max subscription to Claude; the cheap codex plan goes a really long way. I use it a few times a day for debugging, finding bugs, and reviewing my work. I've ran out of usage a couple of times, but only when I lean on it way more than I should.

I only ever use it on the high reasoning mode, for what it's worth. I'm sure it's even less of a problem if you turn it down.

Foobar8568 · 2025-12-19T06:16:12 1766124972

$200 on claude for vibe coding, $20 on codex for code review and "brainstorming". I use other LLM for a 2nd - 3rd - 4th opinion.

nl · 2025-12-19T00:33:11 1766104391

The $20 does this fine.

The OpenAI token limits seem more generous than the Anthropic ones too.

rbancroft · 2025-12-19T00:51:48 1766105508

Listening to Dario at the NYT DealBook summit, and reading between the lines a bit, it seems like he is basically saying Anthropic is trying to be a reponsible, sustainable business and charging customers accordingly, and insinuating that OpenAI is being much more reckless, financially.

nl · 2025-12-19T02:27:40 1766111260

I think it's difficult to estimate how profitable both are - depends too much on usage and that varies so much.

I think it is widely accepted that Anthropic is doing very well in enterprise adoption of Claude Code.

In most of those cases that is paid via API key not by subscription so the business model works differently - it doesn't rely on low usage users subsidizing high usage users.

OTOH OpenAI is way ahead on consumer usage - which also includes Codex even if most consumers don't use it.

I don't think it matters - just make use of the best model at the best price. At the moment Codex 5.2 seems best at the mid-price range, while Opus seems slightly stronger than Codex Max (but too expensive to use for many things).

grim_io · 2025-12-20T13:17:52 1766236672

Consumer subscriptions are an impossible business, that's why openai NEEDS ads to have any chance of sustainability.

jvermillard · 2025-12-19T07:38:29 1766129909

I use it everyday and the $20 plan is fine

apitman · 2025-12-18T19:02:43 1766084563

It's annoying though because it keeps (accurately) pointing out critical memory bugs that I clearly need to fix rather than pretending they aren't there. It's slowing me down.

gnatolf · 2025-12-18T22:10:01 1766095801

Love it when it circles around a minor issue that I clearly described as temporary hack instead of recognizing the tremendously large gaping hole in my implementation right next to it.

stared · 2025-12-19T10:31:30 1766140290

If you want to combine Claude Code coding with reasoning, it is easy to do it with a plugin - https://github.com/stared/gemini-claude-skills, wrote for myself, but shared in case anyone wants. Somehow bigger context here: https://quesma.com/blog/claude-skills-not-antigravity/.

mattio · 2025-12-19T19:57:11 1766174231

Completely agreed. Used claude and codex both on highest tier next to each other for a month. On complex tasks where Claude would get stuck and not be able to fix it at all, codex would fix the issue in one go. Codex is amazing.

I did found some slip ups in 5.2 where I did a refactor of a client header where I removed two header properties, but 5.2 forgot to remove those from the toArray method of the class. Was using 5.2 on medium (default).

garbagecoder · 2025-12-18T23:08:31 1766099311

Agree. Codex just read my source code for a toy lisp I wrote in ARM64 assembly and learned how to code in that lisp and wrote a few demo programs for me. The was impressive enough. Then it spent some time and effort to really hunt down some problems--there was a single bit mask error in my garbage collector that wasn't showing up until then. I was blown away. It's the kind of thing I would have spent forever trying to figure out before.

josephg · 2025-12-19T00:29:32 1766104172

I've been writing a little port of the SeL4 OS kernel to rust, mostly as a learning exercise. I ran into a weird bug yesterday where some of my code wasn't running - qemu was just exiting. And I couldn't figure out why.

I asked codex to take a look. It took a couple minutes, but it managed to track the issue down using a bunch of tricks I've never seen before. I was blown away. In particular, it reran qemu with different flags to get more information about a CPU fault I couldn't see. Then got a hex code of the instruction pointer at the time of the fault, and used some tools I didn't know about to map that pointer to the lines of code which were causing the problem. Then took a read of that part of the code and guessed (correctly) what the issue was. I guess I haven't worked with operating systems much, so I haven't seen any of those tricks before. But, holy cow!

Its tempting to just accept the help and move on, but today I want to go through what it did in detail, including all the tools it used, so I can learn to do the same thing myself next time.

varjag · 2025-12-19T17:24:00 1766165040

Interestingly it found a GC bug in my toy Lisp that I wrote in Z80 assembly almost 30 years ago. This kind of work appears to be more common than you'd think!

heliumtera · 2025-12-19T00:23:39 1766103819

Maybe you're a garbage programmer and that error was too obvious. Interesting observation, though.

edit: username joke, don't get me banned

tgtweak · 2025-12-18T18:39:56 1766083196

Anecdotally I've found it very good in the exact same case for multi-agent workflows - as the "reviewer"

johnnyfived · 2025-12-19T00:13:45 1766103225

Agreed, I'm surprised how much much care the "extra high" reasoning allows. It easily catches bugs in code other LLMs won't, using it to review Opus 4.5 is highly effective.

rane · 2025-12-19T07:45:26 1766130326

Exactly. This is why the workflow of consulting Gemini/Codex for architecture and overall plan, and then have Claude implement the changes is so powerful.

jvermillard · 2025-12-19T06:37:47 1766126267

I use it mainly for embedded programming and I find codex way better than claude. I don't my the delay anyway I'm slower to code carefully crafted C

kilroy123 · 2025-12-18T18:50:15 1766083815

Interesting what I've seen is it spins and thinks forever. Then just breaks. Which is beyond frustrating.

mccoyb · 2025-12-18T18:52:21 1766083941

If by "just breaks" means "refuses to write code / gives up or reverts what it does" -- yes, I've experienced that.

Experiencing that repeatedly motivated me to use it as a reviewer (which another commenter noted), a role which it is (from my experience) very good at.

I basically use it to drive Claude Code, which will nuke the codebase with abandon.

fragmede · 2025-12-19T11:51:36 1766145096

I've had codex rm -rf the git repo it's working in while running in yolo mode. Twice, even! (Play with fire, you're gonna get burnt.)

I had it whip this up to try and avoid this, while still running it in yolo mode (which is still not recommended).

https://gist.github.com/fragmede/96f35225c29cf8790f10b1668b8...

kilroy123 · 2025-12-18T19:59:06 1766087946

I've seen it think for a long time and then just timeout or something? It just stops and nothing happens.

JamesSwift · 2025-12-18T21:01:02 1766091662

Ive had the same but i only use it through zed so I wasnt sure if it was a codex issue or a zed issue

baq · 2025-12-18T19:24:19 1766085859

we're all senior continue engineers nowadays it seems

mccoyb · 2025-12-16T21:54:59 1765922099

Claude Code was a big jump for me. Another large-ish jump was multi-agents and following the tips from Anthropic’s long running harnesses post.

I don’t go into Claude without everything already setup. Codex helps me curate the plan, and curate the issue tracker (one instance). Claude gets a command to fire up into context, grab an issue - implements it, and then Codex and Gemini review independently.

I’ve instructed Claude to go back and forth for as many rounds as it takes. Then I close the session (\new) and do it again. These are all the latest frontier models.

This is incredibly expensive, but it’s also the most reliable method I’ve found to get high-quality progress — I suspect it has something to do with ameliorating self-bias, and improving the diversity of viewpoints on the code.

I suspect rigorous static tooling is yet another layer to improve the distribution over program changes, but I do think that there is a big gap in folk knowledge already between “vanilla agents” and something fancy with just raw agents, and I’m not sure if just the addition of more rigorous static tooling (beyond the compiler) closes it.

idiotsecant · 2025-12-16T22:46:38 1765925198

How expensive is incredibly expensive?

mccoyb · 2025-12-16T23:06:16 1765926376

If you're maxing out the plans across the platforms, that's 600 bucks -- but if you think about your usage and optimize, I'm guessing somewhere between 200-600 dollars per month.

jazzyjackson · 2025-12-16T23:46:20 1765928780

It's pretty easy to hit a couple hundred dollars a day filling up Opus's context window with files. This is via Anthropic API and Zed.

Going full speed ahead building a Rails app from scratch it seemed like I was spending $50/hour, but it was worth it because the App was finished in a weekend instead of weeks.

I can't bear to go in circles with Sonnet when Opus can just one shot it.

fragmede · 2025-12-17T00:49:31 1765932571

The $200/month Max plan has limits, but making a couple of those seems way cheaper than $50/hr for the ~172 hrs in a month.

mkagenius · 2025-12-17T08:53:13 1765961593

Anthropic via Azure has sent me an invoice of around $8000 for 3-5 days of Opus 4.1 usage and there is no way to track how many tokens during those days and how many cached etc. (And I thought its part of the azure sponsorship but that's another story)

mccoyb · 2025-12-14T05:20:16 1765689616

Nada is the best! Don't forget the mind bending https://dl.acm.org/doi/10.1145/3158140 (not quite on topic, but in the multi-stage rabbit hole)

mccoyb · 2025-12-07T20:48:20 1765140500

That's only part of the reason that this type of content is used in academic papers. The other part is that you never know what PhD student / postdoc / researcher will be reviewing your paper, which means you are incentivized to be liberal with citations (however tangential) just in case someone is reading your paper, and has the reaction "why didn't they cite this work, of which I had some role in?"

Papers with a fake air of authority of easily dispatched with. What is not so easily dispatched with is the politics of the submission process.

This type of content is fundamentally about emotions (in the reviewer of your paper), and emotions is undeniably a large factor in acceptance / rejection.

zipy124 · 2025-12-07T21:03:01 1765141381

Indeed. One can even game review systems by leaving errors in for the reviewers to find so that they feel good about themselves and that they've done their job. The meta-science game is toxic and full of politics and ego-pleasing.