You cannot out-astroturf Claude in this forum, it is impossible.
Anyways, do you get shitty results with the $20/month plan? So did I but then I switched to the $200/month plan and all my problems went away! AI is great now, I have instructed it to fire 5 people while I'm writing this!
Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.
I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.
I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.
I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.
The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...
I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.
5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.
But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.
Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.
When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
Codex 5.3 seems to be a lot chattier. As in, it comments in the chat about things it has done or is about to do. They don't show up as "thinking" CoT blocks, but as regular outputs, but overall the experience is somewhat more like Claude is in that you can spot the problems in model's reasoning much earlier if you keep an eye on it as it works, and steer it away.
Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.
What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.
The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.
Yeah, I really didn't believe in agentic coding until December, that was where it took off from being slightly more useful than hand crafting code to becoming extremely powerful.
And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...
> All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.
Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3
It's relatively easy for people to grok, if a bit niche. Just sometimes confuses LLMs. Humans are much better at holding space for rare exceptions to usual rules than LLMs are.
In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
I think for many/most programmers = 'speed + output' and webdev == "great coding".
Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.
Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.
GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.
Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.
I'd say that GPT 5.2 did slightly better on the stuff that I'm working on currently compared to Opus 4.5, but it's rather niche - a fancy Lojban parser in Haskell). However Opus is much easier to steer interactively because you can see what it's doing in more detail (although 5.3 is much improved in that regard!). I wouldn't feel empty-handed with either model, and both wrote large chunks of code for this project.
All that said, the single biggest reason why I use Codex a lot more is because the $200 plan for it is so much more generous. With Claude, I very quickly burn through the quota and then have to wait for several days or else buy more credit. With Codex, running in High reasoning mode as standard with occasional use of XHigh to write specs or debug gnarly issues, and having agents run almost around the clock in the background, I have hit the limit exactly once so far.
Didn't make a difference for me. Though I will say, so far 4.6 is really pissing me off and I might downgrade back to 4.5. It just refuses to listen to what I say, the steering is awful.
How many people are building the same thing multiple times to compare model performance? I'm much more interested in getting the thing I'm building getting built, than than comparing AIs to each other.
Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.
Hopefully performance will pick up after the rollout.
ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.
The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...
I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.
> The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly
Not very accurate. For each of ARC-AGI-1 and ARC-AGI-2 there is training set and three eval sets: public, semi-private, and private. The ARC foundation runs frontier LLMs on the semi-private set, and the labs give them pre-release API access so they can report release-day evals. They mostly don't allow anyone else to access the semi-private set (except for live Kaggle leaderboards which use it), so you see independent researchers report on the public eval set instead, often very dubious. The private is for Kaggle competitions only, no frontier LLMs evals are possible.
(ARC-AGI-1 results are now largely useless because most of its eval tasks became the ARC-2 training set. However some labs have said they don't train LLMs on the training sets anyway.)
More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.
As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.
> It'll be noteworthy to see the cost-per-task on ARC AGI v2.
Already live. gpt-5.2-pro scores a new high of 54.2% with a cost/task of $15.72. The previous best was Gemini 3 Pro (54% with a cost/task of $30.57).
The best bang-for-your-buck is the new xhigh on gpt-5.2, which is 52.9% for $1.90, a big improvement on the previous best in this category which was Opus 4.5 (37.6% for $2.40).
I’ve gone through this process before and while it was more work it did not take 30 minutes.
I presented a student ID and was escorted through the security line. My baggage was selected for additional screening and I received a pat down search.
I went through an identical procedure on the return flight, right down to the exact words the TSA agent spoke to me while conducting the pat down.
I've also gone through this process, it did take about 30 minutes in my case. That also included waiting for a TSA agent to be available to even start the process. So YMMV, perhaps based on how busy the airport is at the time.
They had me answer a series of questions about past addresses etc, it wasn't just an extra pat down in my case. After answering all the questions correctly they allowed me to continue.
Bezos’ mom had him at 17, his biological father owned a bike shop, and his mother remarried when Bezos was 4 to a Cuban immigrant who came to the country at 16 and ended up working as a petroleum engineer.
They wound up middle class after all that, but I certainly wouldn’t say Bezos came from a “wealthy family”.
Bezos' parents lent him $250k to start Amazon. The point is that by the time Bezos started Amazon they were wealthy and could provide him this safety net. Not many middle class families would be able to loan their kid that much money.
okay but $250k is still $250k right? Most people in the world, for most parts of the world, don't see that kind of money in an entire lifetime of work. Most people think privilege means a trust fund, but a $250k loan of US dollars (life-savings or not) is also a privilege that most people don't have.
i think in this thread the goalposts were slowly moved. people were initially talking about success being predicted by having the excess necessary to comfortably take many shots on goal. it seems like we've granted that this $250k shot was a one-time thing.
it is true but irrelevant to the original topic that this is more money than the global poor ever see, and that this is more money that most people get to have. i don't think anyone was arguing that this represents zero privilege
Do you have a source for that being their life savings?
Most of your points have nothing to do with their wealth. Why would it suggest they’re poor if his mom had him at 17 and was taking night classes while raising him? She wasn’t employed, that just sounds like she herself was still able to take risks beyond her means probably because her father was wealthy.
Do you have a source for that not being their life savings? It sounds like you're just making assumptions and guesses as well; if you're going to assert Bezos came from wealth in the first place, you have to back that up. Perusing the "early life" section of Bezos' Wikipedia page doesn't suggest to me that he came from money, at least. But I don't see anyone on either side of the argument presenting anything beyond that.
> Do you have a source for that not being their life savings?
I mean there are many sources that talk about the $300k he received from his family to start Amazon, it's a famous story. None of those sources mention that it was his family's life savings. I don't really know how to provide a source that says it wasn't his family's life savings, but I also can't provide a source that says he wasn't an alien from Zeta Reticula. This is generally the problem with proving a negative and why the onus is usually on the person making a positive assertion.
> if you're going to assert Bezos came from wealth in the first place, you have to back that up.
I did, I'm saying that a family that can give their son $300k to start a business in 1993 is wealthy. That would be about $674k today.
Yep, my father, with no business training or college was funded by my grandfather and was in business for years, decades. He ultimately failed without any savings and died in poverty. Being a small business owner was the only job he ever had.
My grandfather was similar--he was the first one to leave the farm life and tried several different careers and businesses. He worked for a railroad, was a realtor, owned a lumber yard, and lastly owned a delicatessen. The lumber yard nearly destroyed the entire family because he would sell on credit and then contractors failed to pay up on time. It was a huge disaster and the the thing is, this was way before the Home Depot national type chains or the "84 Lumber" regional type chains and if he had had any business acumen at all, he could have been the franchise. People don't know what they don't know. Anyways, my dad worked for my grandfather for free for several years and screwed up his life quite a bit doing so in order to "save the family" and I think my dad has told me this damn story every single time I have called him on the telephone for at least the past 30 years. His complex over the whole situation must be enormous!
This is why I never started a business myself. I figured it was a family curse to fail at business.
Bezo’s maternal grandfather worked for the Department of Energy and owned a ranch in Texas. They were wealthy enough to have $300k to give to Jeff in 1993.
For one, the simple answer is incomplete. It gives the fully unwrapped type of the array but you still need something like
type FlatArray<T extends unknown[]> = Flatten<T[number]>[]
The main difference is that the first, rest logic in the complex version lets you maintain information TypeScript has about the length/positional types of the array. After flattening a 3-tuple of a number, boolean, and string array TypeScript can remember that the first index is a number, the second index is a boolean, and the remaining indices are strings. The second version of the type will give each index the type number | boolean | string.
There's a lot of history behind WhatWG that revolves around XML.
WhatWG is focused on maintaining specs that browsers intend to implement and maintain. When Chrome, Firefox, and Safari agree to remove XSLT that effectively decides for WhatWG's removal of the spec.
I wouldn't put too much weight behind who originally proposed the removal. It's a pretty small world when it comes to web specifications, the discussions likely started between vendors before one decided to propose it.
The issue is you can’t say to put little weight who originally proposed the removal if the other poster is putting all the weight on Google, who didn’t even initially propose it
I wouldn't put weight on the initial proposer either way. As best I've been able to keep up with the topic, google has been the party leading the charge arguing for the removal. I thought they were also the first to announce their decision, though maybe my timing is off there.
By browser vendors, you mean? Yes it seems like they were in agreement and many here seem to think that was largely driven by google though that's speculation.
Users and web developers seemed much less on board though[1][2], enough that Google referenced that in their announcement.
Yes, that's what I mean. In this comment tree, you've said:
> google has been the party leading the charge arguing for the removal.
and
> many here seem to think that was largely driven by google though that's speculation
I'm saying that I don't see any evidence that this was "driven by google". All the evidence I see is that Google, Mozilla, and Apple were all pretty immediately in agreement that removing XSLT was the move they all wanted to make.
You're telling us that we shouldn't think too hard about the fact that a Mozilla staffer opened the request for removal, and that we should notice that Google "led the charge". It would be interesting if somebody could back that up with something besides vibes, because I don't even see how there was a charge to lead. Among the groups that agreed, that agreement appears to have been quick and unanimous.
In the github issues I have followed, including those linked above, I primarily saw Google engineers arguing for removing XSLT from the spec. I'm not saying they are the sole architects of the spec removal, and I'm not claiming to have seen all related discussions.
I am sharing my view, though, that Google engineers have been the majority share of browser engineer comments I've seen arguing for removing XSLT.
Probably if Mozilla didn't push for it initially XSLT would stay around for another decade or longer.
Their board syphons the little money that is left out of their "foundation + corporation" combo, and they keep cutting people from Firefox dev team every year. Of course they don't want to maintain pieces of web standards if it means extra million for their board members.
I'm convinced Mozilla is purposefully engineered to be rudderless: C-suite draw down huge salaries, approve dumb, mission-orthgonal objectives, in order to keep Mozilla itself impotent in ever threatening Google.
Mozilla is Google's antitrust litigation sponge. But it's also kept dumb and obedient. Google would never want Mozilla to actually be a threat.
If Mozilla had ever wanted a healthy side business, it wasn't in Pocket, XR/VR, or AI. It would have been in building a DevEx platform around MDN and Rust. It would have synergized with their core web mission. Those people have since been let go.
> If Mozilla had ever wanted a healthy side business, it wasn't in Pocket, XR/VR, or AI. It would have been in building a DevEx platform around MDN and Rust[…] Those people have since been let go.
The first sentence isn't wrong, but the last sentence is confused in the same way that people who assume that Wikimedia employees have been largely responsible for the content on Wikipedia are confused about how stuff actually makes it into Wikipedia. In reality, WMF's biggest contribution is providing infrastructure costs and paying engineers to develop the Mediawiki platform that Wikipedia uses.
Likewise, a bunch of the people who built up MDN weren't and never could be "let go", because they were never employed by Mozilla to work on MDN to begin with.
(There's another problem, too, which is that addition to selling short a lot of people who are responsible for making MDN as useful as it is but never got paid for it, it presupposes that those who were being paid to work on MDN shouldn't have been let go.)
So the idea is that some group has been perpetuating a decade or so's worth of ongoing conspiracy to ensure that Mozilla continues to exist but makes decisions that "keep Mozilla itself impotent"?
That seems to fail occam's razor pretty hard, given the competing hypotheses for each of their decisions include "Mozilla staff think they're doing a smart thing but they're wrong" and "Mozilla staff are doing a smart thing, it's just not what you would have done".
I guess you mean except Mozilla and Safari...which are the two other competing browser engines? It's not like a it's a room full of Chromium based browsers.
Mozilla has proven they can exist in a free market; really and truly, they do compete.
Safari is what I'm concerned about. Without Apple's monopoly control, Safari is guaranteed to be a dead engine. WebKit isn't well-enough supported on Linux and Windows to compete against Blink and Gecko, which suggests that Safari is the most expendable engine of the three.
I really can’t imagine Safari is going anywhere. Meanwhile the Mozilla Foundation has been very poorly steering the ship for several years and has rightfully earned the reputation it has garnered as a result. There’s a reason there are so many superior forks. They waste their time on the strangest pet projects.
Honestly the one thing I don’t begrudge them is taking Google’s money to make them the default search engine. That’s a very easy deal with the devil to make especially because it’s so trivial to change your default search engine which I imagine a large percentage of Firefox users do with glee. But what they have focused on over the last couple of years has been very strange to watch.
I know Proton gets mixed feelings around here, but to me it’s always seemed like Proton and Mozilla should be more coordinated. Feel like they could do a lot of interesting things together
Thankfully the New York Times lost their attempt to force OpenAI to continue preserving all logs on an ongoing basis, but they still need to keep some of the records they retained before September.
Being able to search browser history with natural language is the feature I am most excited for. I can't count the number of times I've spent >10 minutes looking for a link from 5 months ago that I can describe the content of but can't remember the title.
In my experience, as long as the site is public, just describing what I want to ChatGPT 5 (thinking) usually does the trick, without having to give it access to my own browsing history.
Google is an established business, OpenAI is desperately burning money trying to come up with a business plan. Exports controls and compliance probably isn't going to be today's problem for them, ever.
They don't, the Gemini crap is dead in the water and only people who care about it are hackernews people or some weirdos. For normies ChatGPT equals AI and that's that, they already won by the brand alone.
When normies hear Gemini, they cringe and get that icky feeling.
It didn't help that when Gemini came out it was giving you black founding fathers and Asian nazis.
My dad uses Gemini because it's the default thingy on his android phone - I asked him if he used ChatGPT and he said yes and navigated to Gemini. Most people really don't care that much I think.
At some point, Europe will learn that if they keep preventing international solutions without creating a climate in which similar or better local solutions can emerge, they are cutting their own nose to spite the face. There are secondary and tertiary effects of this, and eventually the 'huge market' will shrink in importance. I mean, Brazil is a huge market, and no-one cares about them thanks to brain-dead legislation concerning tech imports and economic irrelevance.
No one cares about it because you get robbed on gunpoint at the stoplights.
Again no one in Europe cares about some Gemini because frankly no one even knows what it is. They had their run with the black founding fathers and most people who tried it then dismissed it forever.
Isn’t this what Recall in Windows 11 is trying to solve, and everyone got super up in arms over it?
I have no horse in the race either way, but I do find it funny how HN will swoon over a feature from one company and criticize to no end that same feature from another company just because of who made it.
At least Recall is on-device only, both the database and the processing.
I'm the last person to defend OpenAI on literally anything and personally I hope they crash and burn in a spectacular fashion and take the whole market down with them, but you at least have a choice in using Atlas as it's simply a program that you install on your computer of your own volition. With Recall, there's no choice, M$ will just shove it down your throat whether you want it or not, and most likely (knowing their history it's pretty much a guarantee) you'll be stuck with the privacy nightmare that is Recall with nothing you can do about it.
So the pushback makes perfect sense to me. Also, HN isn't 1 entity, it's many people with many different opinions, you can easily find people who were/are excited about Recall the same way people are excited about Atlas.
I think it makes sense, many don't have a choice to run Windows (Linux/Mac won't work for them for whatever reason). If MS turned on Recall without a disable (and its not hard to believe they wouldn't, onedrive), people would be upset.
With ChatGPT Atlas, you simply uninstall it. done.
Are we talking searching the URLs and titles? Or the full body of the page? The latter would require tracking a fuckton of data, including a whole lot of potentially sensitive data.
All of these LLMs already have the ability to go fetch content themselves, I'd imagine they'd just skim your URLs then do it's own token-efficient fetching. When I use research mode with Claude it crawls over 600 web pages sometimes so imagine they've figured out a way to skim down a lot of the actual content on pages for token context.
I made my own browser extension for that, uses readability and custom extractors to save content, but also summarizes the content before saving. Has a blacklist of sites not to record. Then I made it accessible via MCP as a tool, or I can use it to summarize activity in the last 2 weeks and have it at hand with LLMs.
I find browser history used to be pretty easy to search through and then Google got cute by making it into your "browsing journeys" or something and suddenly I couldn't find anything
There is no balancing happening here. YouTube needs to make an API call to attribute a view to a video, and easylist started blocking that API call. YouTube was perfectly happy a month ago to count views for users that were blocking ads, and presumably remains happy to do so.
The only thing that changed is easylist blocked the API.
reply