Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5. I've also had some problems that only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think Anthropic is making the right decisions with their models. Given that software engineering is probably one of the very few domains of AI usage that is driving real, serious revenue: I have far better feelings about Anthropic going into 2026 than any other foundation model. Excited to put Opus 4.5 through its paces.



> only Claude Code has been able to really solve; Sonnet 4.5 in there consistently performs better than Sonnet 4.5 anywhere else.

I think part of it is this[0] and I expect it will become more of a problem.

Claude models have built-in tools (e.g. `str_replace_editor`) which they've been trained to use. These tools don't exist in Cursor, but claude really wants to use them.

0 - https://x.com/thisritchie/status/1944038132665454841?s=20


This feels like a dumb question, but why doesn't Cursor implement that tool?

I built my own simple coding agent six months ago, and I implemented str_replace_based_edit_tool (https://platform.claude.com/docs/en/agents-and-tools/tool-us...) for Claude to use; it wasn't hard to do.


Maybe they want to have their own protocol and standard for file editing for training and fine-tuning their own models, instead of relying on Anthropic standard.

Or it could be a sunk cost associated with Cursor already having terabytes of training data with old edit tool.


Is the code to your agent and its implementation of "str_replace_based_edit_tool" public anywhere? If not, can you share it in a Gist?


Maybe this is a flippant response, but I guess they are more of a UI company and want to avoid competing with the frontier model companies?

They also can’t get at the models directly enough, so anything they layer in would seem guaranteed to underperform and/or consume context instead of potentially relieving that pressure.

Any LLM-adjacent infrastructure they invest in risks being obviated before they can get users to notice/use it.


They did release the Composer model and people praise the speed of it.


TIL! I'll finally give Claude Code a try. I've been using Cursor since it launched and never tried anything else. The terminal UI didn't appeal to me, but knowing it has better performance, I'll check it out.

Cursor has been a terrible experience lately, regardless of the model. Sometimes for the same task, I need to try with Sonnet 4.5, ChatGPT 5.1 Codex, Gemini Pro 3... and most times, none managed to do the work, and I end up doing it myself.

At least I’m coding more again, lol


Glad you mentioned "Cursor has been a terrible experience lately", as I was planning to finally give it a try. I'd heard it has the best auto-complete, which I don't get use VSCode with Claude Code in the terminal.


You should still give it a try. Can’t speak for their experience, but doesn’t ring true for me.


+1, it had a bad period when they were hyperscaling up, but IME they've found their pace (very) recently - I almost ditched cursor in the summer, but am a quite happy user now.


I haven’t used Cursor since I use Neovim and it’s hard to move out.

The auto-complete suggestions from FIM models (either open source or even something Gemini Flash) punch far above their weight. That combined with CC/Codex has been a good setup for me.


I get the same impression. Even GPT 5.1 Codex is just sooo slow in Cursor. Claude Code with Sonnet is still the benchmkar. Fast and good.


I was evaluating codex vs claude code the past month and GPT 5.1 codex being slow is just the default experience I had with it.

The answers were mostly on par (though different in style which took some getting used to) but the speed was a big downer for me. I really wanted to give it an honest try but went back to Claude Code within two weeks.


You can install the Claude Code VS Code extension in Cursor and you get a similar AI side pane as the main Cursor composer.


That’s just Claude Code then. Why use cursor?


People like the tab completion model in Cursor.


And they killed Supermaven.

I've actually been working on porting the tab completion from Cursor to Zed, and eventually IntelliJ, for fun

It shows exactly why their tab completion is so much better than everyone else's though: it's practically a state machine that's getting updated with diffs on every change and every file you're working with.

(also a bit of a privacy nightmare if you care about that though)


it's not about the terminal, but about decoupling yourself from looking at the code. The Claude app lets you interact with a github repo from your phone.


This is not the way

these agents are not up to the task of writing production level code at any meaningful scale

looking forward to high paying gigs to go in and clean up after people take them too far and the hype cycle fades

---

I recommend the opposite, work on custom agents so you have a better understanding of how these things work and fail. Get deep in the code to understand how context and values flow and get presented within the system.


> these agents are not up to the task of writing production level code at any meaningful scale

This is obviously not true, starting with the AI companies themselves.

It's like the old saying "half of all advertising doesn't work; we just don't which half that is." Some organizations are having great results, while some are not. From the multiple dev podcasts I've listened to by AI skeptics have had a lightbulb moment where they get AI is where everything is headed.


Not a skeptic, I use AI for coding daily and am working on a custom agent setup because, through my experience for more than a year, they are not up to hard tasks.

This is well known I thought, as even the people who build the AIs we use talk about this and acknowledge their limitations.


I'm pretty sure at this point more than half of Anthropic's new production code is LLM-written. That seems incompatible with "these agents are not up to the task of writing production level code at any meaningful scale".


how are you pretty sure? What are you basing that on?

If true, could this explain why Anthropics APIs are less reliable than Gemini's? (I've never gotten a service overloaded response from Google like I did from Anthropic)


Quoting a month old post: https://www.lesswrong.com/posts/prSnGGAgfWtZexYLp/is-90-of-c...

  My current understanding (based on this text and other sources) is:
  - There exist some teams at Anthropic where around 90% of lines of code that get merged are written by AI, but this is a minority of teams.
  - The average over all of Anthropic for lines of merged code written by AI is much less than 90%, more like 50%.
> I've never gotten a service overloaded response from Google like I did from Anthropic

They're Google, they out-scale everyone. They run more than 1.3 quadrillion tokens per month through LLMs!


You cannot clean up the code, it is too verbose. That said, you can produce production ready code with AI, you just need to put up very strong boundaries and not let it get too creative.

Also, the quality of production ready code is often highly exaggerated.


I have AI generated, production quality code running, but it was isolated, not at scale or broad in view / spanning many files or systems

What I mean more is that as soon as the task becomes even moderately sized, these things fail hard


> these agents are not up to the task of writing production level code at any meaningful scale

I think the new one is. I could be the fool and be proven wrong though.


It's marginally better, no where close to game changing, which I agree will require moving beyond transformers to something we don't know yet


Interesting. Tell me more.


https://apps.apple.com/us/app/claude-by-anthropic/id64737536...

Has a section for code. You link it to your GitHub, and it will generate code for you when you get on the bus so there's stuff for you to review after you get to the office.


Thanks. Still looking for some kind of total code by phone thing.


The app version is iPhone only, you don’t get Code in the Android app, you have to use a web browser.

I use it every day. I’ll write the spec in conversation with the chatbot, refining ideas, saying “is it possible to …?” Get it to create detailed planning and spec documents (and a summary document about the documents). Upload them to Github and then tell Code to make the project.

I have never written any Rust, am not an evangelist, but Code says it finds the error messages super helpful so I get it to one shot projects in that.

I do all this in the evenings while watching TV with my gf.

It amuses me we have people even this thread claiming what it already does is something it can’t do - write working code that does what is supposed to.

I get to spend my time thinking of what to create instead of the minutiae of “ok, I just need 100 more methods, keep going”. And I’ve been coding since the 1980 so don’t think I’m just here for the vibes.



Can you run the apps without going through Apple? Do you need a developer account?


My workflow was usually to use Gemini 2.5 Pro (now 3.0) for high-level architecture and design. Then I would take the finished "spec" and have Sonnet 4.5 perform the actual implementation.


Same here. Gemini really excels at all the "softer" parts of the development process (which, TBH, feels like most of the work). And Claude kicks ass at the actual code authoring.

It's a really nice workflow.


I use plan mode in claude code, then use gpt-5 in codex to review the plan and identify gaps and feed it back to claude. Results are amazing.


Yeah, I’ve used vatiations of the “get frontier models to cross-check and refine each others work” pattern for years now and it really is the path to the best outcomes in situations where you would otherwise hit a wall or miss important details.


It’s my approach in legal as well. Claude formulates its draft, then it prompts codex and gemini for theirs. Claude then makes recommendations for edits to its draft based on others. Gemini’s plan is almost always the worst, but even it frequently has at least one good point to make.


If you're not already doing that you can wire up a subagent that invokes codex in non interactive mode. Very handy, I run Gemini-cli and codex subagents in parallel to validate plans or implementations.


This is the way. However, there a a lot of approaches to ensemble approaches. I wish there were some good benchmarks for various domains.


I was doing this but I got worried I will lose touch with my critical thinking (or really just thinking for that matter). As it was too easy to just copy paste and delegate the thinking to The Oracle.


Of course the Great Elephant of Postgres should do the thinking! And it is, as known, does not forget anything...


This is how I do it. Though, I've been using Composer as my main driver more an more.

* Composer - Line-by-Line changes * Sonnet 4.5 - Task planning and small-to-medium feature architecture. Pass it off to Composer for code * Gemini Pro - Large and XL architecture work. Pass it off to Sonnet to breakdown into tasks.


I like this plan, too - gemini's recent series have long seemed to have the best large context awareness vs competing frontier models - anecdotally, although much slower, I think gpt-5's architecture plans are slightly better.


Same here. But with GPT 5.1 instead of Gemini.


I've done this and it seems to work well. I ask Gemini to generate a prompt for Claude Code to accomplish X


What specific output would you ask Gemini to create for Sonnet? Thanks in advance!


I really don’t understand the hype around Gemini. Opus/Sonnet/GPT are much better for agentic workflows. Seems people get hyped for the first few days. It also has a lot to do with Claude code and Codex.


Gemini is a lot more bang for the buck. It's not just cheaper per token, but with the subscription, you also get e.g. a lot more Deep Research calls (IIRC it's something like 20 per day) compared to Anthropic offerings.

Also, Gemini has that huge context window, which depending on the task can be a big boon.


Google deep research writes way too much useless fluff though, like introduction to the industry etc.


I'm completely the opposite. I find Gemini (even 2.5 Pro) much, much better than anything else. But I hate agentic flows, I upload the full context to it in aistudio and then it shines - anything agentic cannot even come close.


I recently wrote a small CLI tool for scanning through legacy codebases. For each file, it does a light parse step to find every external identifier (function call, etc...), reads those into the context, and then asks questions about the main file in question.

It's amazing for trawling through hundreds of thousands of lines of code looking for a complex pattern, a bug, bad style, or whatever that regex could never hope to find.

For example, I recently went through tens of megabytes(!) of stored procedures looking for transaction patterns that would be incompatible with read committed snapshot isolation.

I got an astonishing report out of Gemini Pro 3, it was absolutely spot on. Most other models barfed on this request, they got confused or started complaining about future maintainability issues, stylistic problems or whatever, no matter how carefully I prompted them to focus on the task at hand. (Gemini Pro 2.5 did okay too, but it missed a few issues and had a lot of false positives.)

Fixing RCSI incompatibilities in a large codebase used to be a Herculean task, effectively a no-go for most of my customers, now... eminently possible in a month or less, at the cost of maybe $1K in tokens.


If this is a common task for you, I'd suggest instead using an LLM to translate your search query into CodeQL[1], which is designed to scan for semantic patterns in a codebase.

1. https://codeql.github.com/


+1 - Gemini is consistently great at SQL in my experience. I find GPT 5 is about as good as gemini 2.5 pro (please treat is as praise). Haven't had a chance to put Gemini 3 to a proper sql challenge yet.


Is there any chance you'd be willing to share that tool? :)


It's a mess vibe coding combined with my crude experiments with the new Microsoft Agent Framework. Not something that's worth sharing!

Also, I found that I had to partially rewrite it for each "job", because requirements vary so wildly. For example, one customer had 200K lines of VBA code in an Access database, which is a non-trivial exercise to extract, parse, and cross-reference. Invoking AI turned out to be by far the simplest part of the whole process! It wasn't even worth the hassle of using the MS Agent Framework, I would have been better off with plain HTTPS REST API calls.


I think you're both correct. Gemini is _still_ not that good at agentic tool usage. Gemini 3 has gotten A LOT better, but it still can do some insane stupid stuff like 2.5


Personally my hype is for the price, especially for Flash. Before Sonnet 4.5 was competitive with Gemini 2.5 Pro, the latter was a much better value than Opus 4.1.


with gemini you have to spend 30 minutes deleting hundreds of useless comments littered in the code that just describe what the code itself does


The comments would improve code quality because it's a way for the LLM to use a scratchpad to perform locally specific reasoning before writing the proceeding code block, which would be more difficult for the LLM to just one shot.

You could write a postprocessing script to strip the comments so you don't have to do it manually.


I haven't had a comment generated for 3.0 pro at all unless specified.


I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

It gave me the Youtube-URL to Rick Astley.


If you're asking an LLM to compute something "off the top of its head", you're using it wrong. Ask it to write the code to perform the computation and it'll do better.

Same with asking a person to solve something in their head vs. giving them an editor and a random python interpreter, or whatever it is normal people use to solve problems.


the decent models will (mostly) decide when they need to write code for problem solving themselves.

either way a reply with a bogus answer is the fault of the provider and model, not the question-asker -- if we all need to carry lexicons around to remember how to ask the black box a question we may as well just learn a programming language outright.


I disagree, the answer you get is dictated by the question you ask. Ask stupid, get stupid. Present the problem better, get a better answer. These tools are trained to be highly compliant, so you get what you ask.

Same happens with regular people - a smart person doing something stupid because they weren't overly critical and judgingof your request - and these tools have much more limited thinking/reasoning than a normal person would have, even if they seem to have a lot more "knowledge".


Yes, Sonnet 4.5 tried like 10min until it had it. Way too long though.


base64 specifically is something that the original GPT-4.0 could decode reliably all by itself.


I could also decode it by hand, but doing so is stupid and will be unreliable. Same with an LLM - the network is not geared for precision.


You don't know what it's geared for until you try. Like I said, GPT-4 could consistently encode and decode even fairly long base64 sequences. I remember once asking it for an SVG image, and it responded with HTML that had an <img> tag in it with a data URL embedding the image - and it worked exactly as it should.

You can argue whether that is a meaningful use of model capacity, and sure, I agree that this is exactly the kind of stuff tool use is for. But nevertheless the bar was set.


Sure you do, the architecture is known. An LLM will never be appropriate to use for exact input transforms and will never be able to guarantee accurate results - the input pipeline yields abstract ideas as text embedding vectors, not a stream of bytes - but just like a human it might have the skill to limp through the task with some accuracy.

While your base64 attempts likely went well, that it "could consistently encode and decode even fairly long base64 sequences" is just an anecdoate. I had the same model freak out in an empty chat, transcribing the word "hi" to a full YouTube "remember to like and subscribe" epilogue - precision and determinism are the parameters you give up when making such a thing.

(It is around this time that the models learnt to use tools autonomously in a response, such as running small code snippets which would solve the problem perfectly well, but even now it is much more consistent to tell it to do that, and for very long outputs the likelihood that it'll be able to recite the result correctly drops.)


> I gave Sonnet 4.5 a base64 encoded PHP serialize() json of an object dump and told him to extraxt the URL within.

This is what I imagine the LLM usage of people who tell me AI isn't helpful.

It's like telling me airplanes aren't useful because you can't use them in McDonald's drive-through.


I find it hilarious that it rick rolled you. I wonder if that is an easter egg of some sort?


You should probably tell AI to write you programs to do tasks that programs are better at than minds.


Don't use LLMs for a task a human can't do, they won't do it well.


A human could easily come up with a base64 -d | jq oneliner.


So can the LLM, but that wasn't the task.


I'm surprised AIs don't automatically decide when to use code. Maybe next year.


They do, it just depends on the tool you're using and the instruction you give it. Claude Code usually does.


Almost any modern LLM can do this, even GPT-OSS


it. Not him.


You can ask it. Each model responds slightly differently to "What pronouns do you prefer for yourself?"

Opus 4.5:

I don’t have strong preferences about pronouns for myself. People use “it,” “they,” or sometimes “he” or “she” when referring to me, and I’m comfortable with any of these.

If I had to express a slight preference, “it” or “they” feel most natural since I’m an AI rather than a person with a gender identity. But honestly, I’m happy with whatever feels most comfortable to you in conversation.

Haiku 4.5:

I don’t have a strong preference for pronouns since I’m an AI without a gender identity or personal identity the way humans have. People typically use “it” when referring to me, which is perfectly fine. Some people use “they” as well, and that works too.

Feel free to use whatever feels natural to you in our conversation. I’m not going to be bothered either way.


It's Claude. Where I live, that is a male name.


Yeah I think Sonnet is still the best in my experience but the limits are so stingy I find it hard to recommend for personal use.


The model is great it is able to code up some interesting visual tasks(I guess they have pretty strong tool calling capapbilities). Like orchestrate prompt -> image generate -> Segmentation -> 3D reconstruction. Checkout the results here https://chat.vlm.run/c/3fcd6b33-266f-4796-9d10-cfc152e945b7. Note the model was only used to orchestrate the pipeline, the tasks are done by other models in an agentic framework. They much have improved tool calling framework with all the MCP usage. Gemini 3 was able to orchestrate the same but Claude 4.5 is much faster


I have a side-project prototype app that I tried to build on the Gemini 2.5 Pro API. I have not tried 3 yet, however the only improvements I would like to see is in Gemini's ability to:

1. Follow instructions consistently

2. API calls to not randomly result in "resource exhausted"

Can anyone share their experience with either of these issues?

I have built other projects accessing Azure GPT-4.1, Bedrock Sonnet 4, and even Perplexity, and those three were relatively rock solid compared to Gemini.


What you describe could also be the difference in the hallucination rate [0]. Opus 4.5 has the lead here and Gemini 3 Pro performs here quite bad compared to the other benchmarks.

[0] https://artificialanalysis.ai/?omniscience=omniscience-hallu...


Gemini 3 was awful when i gave it a spin. It was worse than cursor’s composer model.

Claude is still a go to but i have found that composer was “good enough” in practice.


I think the 'Agentic coding SWE-Bench Verified' [1] was actually the one benchmark where Google didn't even claim to beat Sonnet 4.5 ;-)

[1] https://deepmind.google/models/gemini/pro/


I've had problems solved incorrectly and edge cases missed by Sonnet and by other LLMs (ChatGPT, Gemini) and the other way around too. Once they saw the other model's answer, they admitted their "critical mistake". It's all about how much of your prompt/problem/context falls outside the model's training distribution.


> I've played around with Gemini 3 Pro in Cursor, and honestly: I find it to be significantly worse than Sonnet 4.5.

That's my experience too. It's weirdly bad at keeping track of its various output channels (internal scratchpad, user-visible "chain of thought", and code output), not only in Cursor but also on gemini.google.com.


> played around with

You'll never get an accurate comparison if you only play

We know by now that it takes time to "get to know a model and it's quirks"

So if you don't use a model and cannot get equivalent outputs to your daily driver, that's expected and uninteresting


I rotate models frequently enough that I doubt my personal access patterns are so model specific that they would unfairly advantage one model over another; so ultimately I think all you're saying is that Claude might be easier to use without model-specific skilling than other models. Which might be true.

I certainly don't have as much time on Gemini 3 as I do on Claude 4.5, but I'd say my time with the Gemini family as a whole is comparable. Maybe further use of Gemini 3 will cause me to change my mind.


yeah, this generally vibes with my experience, they aren't that different

As I've gotten into the agentic stuff more lately, I suspect a sizeable part of the different user experiences comes down to the agents and tools. In this regard, Anthropic is probably in the lead. They certainly have become a thought leader in this area by sharing more of their experience and know hows in good posts and docs


I suspect Cursor is not the right platform to write code on. IMO, humans are lazy and would never code on Cursor. They default to code generation via prompt which is sub-optimal.


> They default to writing code via prompt generation which is sub-optimal.

What do you mean?


If you're given a finite context window, what's the most efficient token to present for a programming task? sloppy prompts or actual code (using it with autocomplete)


I'm not sure you get how Cursor works. You add both instructions and code to your prompt. And it does provide its own autocomplete model as well. And... lots of people use that. (It's the largest platform today as far as I can tell)


I wish I didn't know how Cursor works. It's a great product for 90% of programmers out there no doubt.


I have heard that gemini 3 is not that great in cursor, but excellent in Antigravity. I don't have a time to personally verify all that though.


I‘ve had no success using Antigravity, which is a shame because the ideas are promising, but the execution so far is underwhelming. Haven‘t gotten past an initial plannin doc which is usually aborted due to model provider overload or rate limiting.


Give it a try now, the launch day issues have gone.

If anyone uses Windsurf, Anti Gravity is similar but the way they have implemented walkthrough and implementation plan looks good. It tells the user what the model is going to do and the user can put in line comments if they want to change something.


it's better than at launch, but I still get random model response errors in anti-gravity. it has potential, but google really needs to work on the reliability.

It's also bizarre how they force everyone onto the "free" rate limits, even those paying for google ai subscriptions.


I've had really good success with Antigrav. It's a little bit rough around the edges as it's a VS Code fork so things like C# Dev Kit won't install.

I just get rate-limited constantly and have to wait for it to reset.


My first couple of attempts at antigravity / Gemini were pretty bad - the model kept aborting and it was relatively helpless at tools compared to Claude (although I have a lot more experience tuning Claude to be fair). Seems like there are some good ideas in antigravity but it’s more like an alpha than a product.


Nothing is great in Cursor.


It's just not great at coding, period. In Antigravity it takes insane amounts of time and tokens for tasks that copilot/sonnet would solve in 30 seconds.

It generates tokens pretty rapidly, but most of them are useless social niceties it is uttering to itself in it's thinking process.


I think gemini 3 is hot garbage in everything. Its great on a greenfield trying to 1 shot something, if you're working on a long term project it just sucks.


I've had Gemini 3 Pro solve issues that Claude Code failed to solve after 10 tries. It even insulted some code that Sonnet 4.5 generated


I'm also finding Gemini 3 (via Gemini CLI) to be far superior to Claude in both quality and availability. I was hitting Claude limits every single day, at that point it's literally useless.


Hopefully once Anthropic has 1 million Google TPUs in use they will have sufficient capacity.


Same here. Gemini just rips shit out and doesn't understand the flow well between event based components either


Gemini 3 in antigravity is amazing


Gemini being terrible in Cursor is a well known problem.

Unfortunately, for all its engineers, Google seems the most incompetent at product work.


Gemini pro 3 was a let down for me too


I’ve trashed Gemini non-stop (seriously, check my history on this site), but 3 Pro is the one that finally made me switch from OpenAI. It’s still hot garbage at coding next to Claude, but for general stuff, it’s legit fantastic.


Tangental observation - I've noticed Gemini 3 Pro's train of thought feels very unique. It has kind of an emotive personality to it, where it's surprised or excited by what it finds. It feels like a senior developer looking through legacy code and being like, "wtf is this??".

I'm curious if this was a deliberate effort on their part, and if they found in testing it provided better output. It's still behind other models clearly, but nonetheless it's fascinating.


Yeah it's COT is interesting, it was supposedly RL on evaluations and gets paranoid that it's being evaluated and in a simulation. I asked it to critique output from another LLM and told it my colleague produced it, in COT it kept writing "colleague" in quotes as if it didn't believe me which I found amusing


My testing of Gemini 3 Pro in Cursor yielded mixed results. Sometimes it's phenomenal. At other times I either get the "provider overloaded" message (after like 5 mins or whatever the timeout is), or the model's internal monologue starts spilling out to the chat window, which becomes really messy and unreadable. It'll do things like:

>> I'll execute.

>> I'll execute.

>> Wait, what if...?

>> I'll execute.

Suffice it to say I've switched back to Sonnet as my daily driver. Excited to give Opus a try.


i’ve tried Gemini in Google AI studio as well and was very disappointed by the superficial responses it provided. It seems like at the level of GPT-5-low or even lower.

On the other hand, it’s a truly multi modal model whereas Claude remains to be specifically targeted at coding tasks, and therefore is only a text model.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: