good point, they are standards, by definition society forced vendors to behave and play nice together. LLMs are not standards yet, and it is just pure bliss that english works fine across different LLMs for now.
Some labs are trying to push their own format and stop it. Specially around reasoning traces, e.g. codex removing reasoning traces between calls and gemini requiring reasoning history. So don't take this for granted.
Claims in the article are incorrect. They conveniently ignore Meta CWM models, which are open-sourced [1] and open-weight [2] and are at 65% SWE-bench verified (with TTS) and 54% pass@1 and the same size (32B dense). So claims like "surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths" and conveniently leaving out the previous OSS SOTA out of your eval tables are ... sketch.
Hey! These are great observations. So first, while TTS can improve performance, we wanted to evaluate the raw capability of our model. This meant generating only one rollout per evaluation instance, which follows other papers in the space like SWE-smith and BugPilot. In addition, TTS adds extra inference cost and is reliant on how rollouts are ranked, two confounding factors for deployable models where memory and inference speed are extremely important.
Following that line of reasoning, context length is another very large confounding factor. Longer context lengths improve performance - but also result in enormous increases in KV cache size and memory requirements. We decide to control for this in our paper and focus at the 32K context length for 32B size models, a context length that already pushes the bounds of what can be "deployable" locally.
Still, we evaluate at 64K context length using YARN and are able to outperform CWM's 54% performance (non TTS), which it achieves using 128K context, a substantial increase over what we use. This is also pretty significant because we only ever train at 32K context, but CWM trains for a full 128K.
The difference is that the Allen Institute models have open training data, not just open code and weights. Meta doesn't share the training data you would need to reproduce their final models. For many uses open-weight models are nearly as good, but for advancing research it's much better to have everything in the open.
Reading their paper, it wasn't trained from scratch, it's a fine tune of a Qwen3-32B model. I think this approach is correct, but it does mean that only a subset of the training data is really open.
> 2. What on earth is this defense of their product?
i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
we use codex review. it's working really well for us.
but i don't agree that it's straightforward. moving the number of bugs catched and signal to noise ratio a few percentage points is a compounding advantage.
it's a valuable problem to solve, amplified by the fact that ai coding produces much more code.
that being said, i think it's damn hard to compete with openai or anthropic directly on a core product offering in the long run. they know that it's an important problem and will invest accordingly.
why do you think the previous management team couldn't pull an Elon and fire 80% of the engineering staff themselves? why they needed an external leadership to take over and do it?
Part that i can't wrap my head around was at least in case of twitter, it was a hostile take over. In case of Vimeo, it didn't look hostile at all.
I wonder if existing management had a lot of social ties to the existing workforce so that they couldn't easily get away doing that themselves without a big hit to their reputation.
Letting someone else do the dirty work allows them to disassociate themselves from the (predictable) outcome and frame it as just business.
I believe this is using Virtualization.framework and not Containerization API from Tahoe, right?
Is there a limit on number of instances you can have per physical mac? i recall there was a hard limit of 2 because of EULA, unless Apple has changed it. (Cupertino really likes to sell you their Macs)
You mentioned "deleting the actual project, since the file sync is two-way", my solution (in agentastic.dev) was to fist copy the code with git-worktree, then share that with the container.
Very few people have the expertise to write efficient assembly code, yet everyone relies on compilers and assemblers to translate high-level code to byte-level machine code. I think same concept is true here.
Once coding agents become trivial, few people will know the detail of the programming language and make sure intent is correctly transformed to code, and the majority will focus on different objectives and take LLM programming for granted.
No, that's a completely different concept, because we have faultless machines which perfectly and deterministically translate high-level code into byte-level machine code. This is another case of (nearly) perfect abstraction.
On the other hand, the whole deal of the LLM is that it does so stochastically and unpredictably.
The unpredictable part isn't new - from a project manager's point of view, what's the difference between an LLM and a team of software engineers? Both, from that POV, are a black box. The "how" is not important to them, the details aren't important. What's important is that what they want is made a reality, and that customers can press on a button to add a product to their shopping cart (for example).
LLMs mean software developers let go of some control of how something is built, which makes one feel uneasy because a lot of the appeal of software development is control and predictability. But this is the same process that people go through as they go from coder to lead developer or architect or project manager - letting go of control. Some thrive in their new position, having a higher overview of the job, while some really can't handle it.
"But this is the same process that people go through as they go from coder to lead developer or architect or project manager - letting go of control."
In those circumstances, it's delegating control. And it's difficult to judge whether the authority you delegated is being misused if you lose touch with how to do the work itself. This comparison shouldn't be pushed too far, but it's not entirely unlike a compiler developer needing to retain the ability to understand machine code instructions.
As someone that started off with assembly issues for a large corporation - assembly code may sometimes contain very similiar issues that mroe high-level code those, the perfection of the abstraction is not guaranteed.
But yeah, there's currently a wide gap between that and a stochastic LLM.
We also have machines that can perfectly and deterministically check written code for correctness.
And the stohastic LLM can use those tools to check whether its work was sufficient, if not, it will try again - without human intervention. It will repeat this loop until the deterministic checks pass.
You can make analysers that check for deeply nested code, people calling methods in the wrong order and whatever you want to check. At work we've added multiple Roslyn analysers to our build pipeline to check for invalid/inefficient code, no human will be pinged by a PR until the tests pass. And an LLM can't claim "Job's Done" before the analysers say the code is OK.
And you don't need to make one yourself, there are tons you can just pick from:
> It's not like testing code is a new thing. Junit is almost 30 years old today.
Unit tests check whether code behaves in specific ways. They certainly are useful to weed out bugs and to ensure that changes don't have unintended side effects.
> And code correctness:
These are tools to check for syntactic correctness. That is, of course, not what I meant.
Algorithmic correctness? Unit tests are great for quickly poking holes in obviously algorithmically incorrect code, but far from good enough to ensure correctness. Passing unit tests is necessary, not sufficient.
Syntactic correctness is more or less a solved problem, as you say. Doesn't matter if the author is a human or an LLM.
It depends on the algorithm of course. If your code is trying to prove P=NP, of course you can't test for it.
But it's disingenuous to claim that even the majority of code written in the world is so difficult algorithmically that it can't be unit-tested to a sufficient degree.
Suppose you're right and the "majority of code" is fully specified by unit testing (I doubt it). The remaining body of code is vast, and the comments in this thread seem to overlook that.
> Very few people have the expertise to write efficient assembly code, yet everyone relies on compilers and assemblers to translate high-level code to byte-level machine code. I think same concept is true here.
That's a poor analogy which gets repeated in every discussion: compilers are deterministic, LLMs are not.
> That's a poor analogy which gets repeated in every discussion: compilers are deterministic, LLMs are not.
Compilers are not used directly, they are used by human software developers who are also not deterministic.
From the perspective of an organization with a business or service-based mission, they already know how to supervise non-deterministic LLMs because they already know how to supervise non-deterministic human developers.
Why does it matter if LLMs are not deterministic? Who cares?
There should be tests covering meaningful functionality, as long as the code passes the tests, ie. the externally observable behaviour is the same, I don't care. (Especially, if many tests can also be autogenerated with the LLM.)
>>> Very few people have the expertise to write efficient assembly code, yet everyone relies on compilers and assemblers to translate high-level code to byte-level machine code. I think same concept is true her
>> That's a poor analogy which gets repeated in every discussion: compilers are deterministic, LLMs are not.
> Why does it matter if LLMs are not deterministic? Who cares?
In the context of this analogy, it matters. If you're not using this analogy, then sure, only the result matters. But when the analogy being used is deterministic, then, yes, it matters.
You can't very well claim "We'll compare this non-deterministic process to this other deterministic process that we know works."
The difference is that if you write in C you can debug in C. You don't have to debug the assembly. You can write an english wish list for an LLM but you will still have to debug the generated code. To debug it you will need to understand it.
Working for a few different clients atm as a freelancer:
Postgres etl pipelines; python glue code around a computer vision model; some rest api integrations to a frontend (outside of my direct control); an llm-backed sql generator that integrates with legacy shitware; a swift ios/macos app...
I agree that I need to invest heavily in testing infrastructure, thanks. My work is pretty heterogeneous, so I kinda put that on the backburner as there's always more pressing short term stuff to tackle...
reply