M2 was one of the most benchmaxxed models we've seen. Huge gap between SWE-B results and tasks it hasn't been trained on. We'll put 2.5 on the list. https://brokk.ai/power-ranking
We got lines-with-anchors working fine as a replacement strategy, the problem was that when you don't make the model echo what it's replacing, it's literally dumber at writing the replacement; we lost more in test failures + retries than we gained in faster outputs.
Makes sense when you think about how powerful the "think before answering" principle is for LLMs, but it's still frustrating
Good to see that Anthropic is honest and open enough to publish a result with a mostly negative headline.
> Importantly, using AI assistance didn’t guarantee a lower score. How someone used AI influenced how much information they retained. The participants who showed stronger mastery used AI assistance not just to produce code but to build comprehension while doing so—whether by asking follow-up questions, requesting explanations, or posing conceptual questions while coding independently.
This might be cynically taken as cope, but it matches my own experience. A poor analogy until I find a better one: I don't do arithmetic in my head anymore, it's enough for me to know that 12038 x 912 is in the neighborhood of 10M, if the calculator gives me an answer much different from that then I know something went wrong. In the same way, I'm not writing many for loops by hand anymore but I know how the code works at a high level and how I want to change it.
(We're building Brokk to nudge users in this direction and not a magic "Claude take the wheel" button; link in bio.)
3-1-1 is rarely enforced. I always got confused why the 100ml limit existed, since I could just take multiple bottles of 100ml of whatever I wanted and it was okay. Then I realized that technically I only could take 3 bottles but I’ve been getting away with more for decades.
> Not because of a sudden outbreak of sanity, but because they have CT scanners now.
What's is the evidence for believing so strongly that airports all over the world have been prohibiting large amounts of liquids due to widespread insanity?
Yeah, I flew thru Eindhoven Airport in the Netherlands a few years ago, and I couldn't believe they let me through with water.
The security used something I would describe as out of an Iron Man film, they were zooming around a translucent 3D view of my backpack. (It was on an LCD display instead of hovering midair, but I was still impressed. But the fact they let me keep the water was even more amazing, hahah.)
> The security used something I would describe as out of an Iron Man film, they were zooming around a translucent 3D view of my backpack. (It was on an LCD display instead of hovering midair, but I was still impressed.
I just flew with two laptops in my backpack which I didn't have to take out for the first time (haven't flown in a while), with a custom PCB with a couple of vivaldi antennas sandwiched in between the laptops.
It was a real trip watching them view the three PCBs as a single stack, then automatically separate them out, and rotate them individually in 3D. The scanner threw some kind of warning and the operator asked me what the custom PCB was, so I had to explain to them it was a ground penetrating radar (that didn't go over well; I had to check the bag)
Tel Aviv has allowing this for quite some time (10 years?). I guess they update their security devices as soon as new technology becomes available.
They don't advertise it, I found out by accident, trying to empty my water bottle by drinking when a security person told me to just put it together with the rest of my stuff. I had no idea that was a thing and was pretty confused.
They’re multi wavelength CT. Basically whenever you see a 4:3 box with a “smiths” logo over the belt it’s going to be a pretty painless process (take nothing out except analog film)
You can do realtime 3D flythroughs on CT scans with open source viewers. If you've ever had one, get your DICOM data set and enjoy living in the future.
I've seen this too in the US, the newer machines let them spin the scan around in 3D space and must make it much easier to tell if something needs inspection or not
Yeah these are pretty common in the US, but they're just not ubiquitous. Many airports will still have a CT machine next to the old one and it just depends on what line you get out in.
Love to see people leveraging static analysis for AI agents. Similar to what we're doing in Brokk but we're more tightly coupled to our own harness. (https://brokk.ai/) Would love to compare notes; if you're interested, hmu at [username]@brokk.ai.
Quick comparison: Auditor does framework-specific stuff that Brokk does not, but Brokk is significantly faster (~1M loc per minute).
Would be really cool to compare notes :D Sent from a "non tech" company email so it doesn't get filtered lol.
My speed really depends on language and what needs indexing. On pure Python projects I get around 220k loc/min, but for deeper data flow in Node apps (TypeScript compiler overhead + framework extraction) it's roughly 50k loc/min.
Curious what your stack is and what depth you're extracting to reach 1M/min - those are seriously impressive numbers! :D
edit: not useless in a absolute sense, but worse than the vanilla gpt models
reply