More

gopalv · 2026-02-03T00:57:39 1770080259

> Making models larger improves overall accuracy but doesn't reliably reduce incoherence on hard problems.

Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.

My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.

So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.

Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.

We can escalate to higher authority and get out of that mess faster if we fail hard and early.

The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.

Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.

[1] - https://arxiv.org/abs/2601.14351

Nevermark · 2026-02-03T05:25:36 1770096336

> Coherence requires 2 opposing forces

This seems very basic to any kind of information processing beyond straight shot predictable transforms.

Expansion and reduction of possibilities, branches, scope, etc.

Biological and artificial neural networks converging into multiple signals, that are reduced by competition between them.

Scientific theorizing, followed by experimental testing.

Evolutionary genetic recombination and mutation, winnowed back by resource competition.

Generation, reduction, repeat.

In a continually coordinated sense too. Many of our systems work best by encouraging simultaneous cooperation and competition.

Control systems command signal proportional to demand, vs. continually reverse-acting error feedback.

gopalv · 2026-02-03T06:58:33 1770101913

> This seems very basic

Yes, this is not some sort of hard-fought wisdom.

It should be common sense, but I still see a lot of experiments which measure the sound of one hand clapping.

In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

If you don't really want the experiments and data from the academic paper, we have a white paper which is completely obvious to anyone who's read High Output Management, Mythical Man Month and Philosophy of Software Design recently.

Nothing in there is new, except the field it is applied to has no humans left.

Nevermark · 2026-02-03T08:40:16 1770108016

> Yes, this is not some sort of hard-fought wisdom.

By basic I didn't mean uninteresting.

In fact, despite the pervasiveness and obviousness of the control and efficiency benefits of push-pull, generating-reducing, cooperation-competition, etc., I don't think I have ever seen any kind of general treatment or characterization that pulled all these similar dynamics together. Or a hierarchy of such.

> In some sense, it is a product of laziness to automate human supervision with more agents, but on the other hand I can't argue with the results.

I think it is the fact that the agents are operating coherently with the respective complementary goals. Whereas, asking one agent to both solve and judge creates conflicting constraints before a solution has begun.

Creative friction.

I am reminded of brainstorming sessions, where it is so important to note ideas, but not start judging them, since who knows what crazy ideas will fit or spark together. Later they can be selected down.

So we institutionalize this separation/staging with human teams too, even if it is just one of us (within our context limits, over two inference sessions :).

maxkfranz · 2026-02-03T03:51:49 1770090709

More or less, delegation and peer review.

gopalv · 2026-02-02T05:38:34 1770010714

> The paper sounds too shallow. The errors data doesn't seem to have a rationale or correlation against the architecture. Specifically, what makes the SAS architecture to have lowest error rates while the similar architecture with independent agents having highest error rates?

I can believe SAS works great until the context has errors which were corrected - there seems to be a leakage between past mistakes and new ones, if you leave them all in one context window.

My team wrote a similar paper[1] last month, but we found the orchestrator is not the core component, but a specialized evaluator for each action to match the result, goal and methods at the end of execution to report back to the orchestrator on goal adherence.

The effect is sort of like a perpetual evals loop, which lets us improve the product every week but agent by agent without the Snowflake agent picking up the Bigquery tools etc.

We started building this Nov 2024, so the paper is more of a description of what worked for us (see Section 3).

Also specific models are great at some tasks, but not always good at others.

My general finding is that Google models do document extraction best, Claude does code well and OpenAI does task management in somewhat sycophantic fashion.

Multi-agents was originally supposed to let us put together a "best of all models" world, but it works at error correcting if I have Claude write code and GPT 5 check the results instead of everything going into one context.

[1] - https://arxiv.org/abs/2601.14351

gopalv · 2025-12-12T16:22:57 1765556577

> But for just the cost of doubling our space, we can use two Bloom filters!

We can optimize the hash function to make it more space efficient.

Instead of using remainders to locate filter positions, we can use a mersenne prime number mask (like say 31), but in this case I have a feeling the best hash function to use would be to mask with (2^1)-1.

AlotOfReading · 2025-12-12T17:47:56 1765561676

This produced strange results on my ternary computer. I had to use a recursive popcnt instead.

piersadrian · 2025-12-12T19:49:01 1765568941

this is my new favorite comment on this cursed website

gopalv · 2025-12-12T03:12:35 1765509155

This is roughly what my startup is doing, automating financials.

We didn't pick this because it was super technical, but because the financial team is the closest team to the CEO which is both overstaffed and overworked at the same time - you have 3-4 days of crunch time for which you retain 6 people to get it done fast.

This was the org which had extremely methodical smart people who constantly told us "We'll buy anything which means I'm not editing spreadsheets during my kids gymnastics class".

The trouble is that the UI that each customer wants has zero overlap with the other, if we actually added a drop-down for each special thing one person wanted, this would look like a cockpit & no new customer would be able to do anything with it.

The AI bit is really making the required interface complexity invisible (but also hard to discover).

In a world where OpenAI is Intel and Anthropic is AMD, we're working on a new Excel.

However, to build something you need to build a high quality message passing co-operating multi-tasking AI kernel & sort of optimize your L1 caches ("context") well.

gopalv · 2025-12-08T21:08:55 1765228135

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

Two minutes is a bit too too much (also fdatasync vs fsync).

senderista · 2025-12-08T23:00:13 1765234813

IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.

tl;dr "multi-transaction group-commit fsync" is alive and well

gopalv · 2025-12-05T18:23:33 1764959013

> produce something as high-quality as GoT

Netflix is a different creature because of streaming and time shifting.

They don't care about people watching a pilot episode or people binge watching last 3 seasons when a show takes off.

The quality metric therefore is all over the place, it is a mildly moderated popularity contest.

If people watch "Love is Blind", you'll get more of those.

On the other hand, this means they can take a slightly bigger risk than a TV network with ADs, because you're likely to switch to a different Netflix show that you like and continue to pay for it, than switch to a different channel which pays a different TV network.

As long as something sticks the revenue numbers stay, the ROI can be shaky.

Black Mirror Bandersnatch for example was impossible to do on TV, but Netflix could do it.

Also if GoT was Netflix, they'd have cancelled it on Season 6 & we'd be lamenting the loss of what wonders it'd have gotten to by Season 9.

gopalv · 2025-11-19T22:03:11 1763589791

> For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.

This was one of the bigger hidden performance issues when I was working on Hive - the default coercion goes to Double, which has a bad hash code implementation [1] & causes joins to cluster & chain, which caused every miss on the hashtable to probe that many away from the original index.

The hashCode itself was smeared to make values near Machine epsilon to hash to the same hash bucket so that .equals could do its join, but all of this really messed up the folks who needed 22 digit numeric keys (eventually Decimal implementation handled it by adding a big fixed integer).

Databases and Double join keys was one of the red-flags in a SQL query, mostly if you see it someone messed up something.

[1] - https://issues.apache.org/jira/browse/HADOOP-12217

gopalv · 2025-11-15T19:57:18 1763236638

> trauma that our parents, or grandparents experienced could lead to behavior modifications and poorer outcomes in us

The nurture part of it is already well established, this is the nature part of it.

However, this is not a net-positive for the folks who already discriminate.

The "faults in our genes" thinking assumes that this is not redeemable by policy changes, so it goes back to eugenics and usually suggests cutting such people out of the gene pool.

The "better nurture" proponents for the next generation (free school lunches, early intervention and magnet schools) will now have to swim up this waterfall before arguing more investment into the uplifting traumatized populations.

We need to believe that Change (with a capital C) is possible right away if start right now.

underlipton · 2025-11-15T22:35:36 1763246136

I would think it's the opposite. Intervention is preventative of further sliding. The alternative - genocide - is expensive; they're generally a luxury of states benefiting from a theft-based windfall.

gopalv · 2025-11-14T05:14:50 1763097290

> Can you build a Linux version? :-)

Generally speaking, it is the hardware not the OS that makes it easier to build for Macs right now.

Apple Neural Engine is a sleeping giant, in the middle of all this.

daemonologist · 2025-11-14T05:19:37 1763097577

Parakeet still runs at 5x realtime on a middle-of-the-road CPU; it should be quite doable (at the cost of some battery life).

gopalv · 2025-11-10T17:14:00 1762794840

> would a fixed line in India typically be above that speed?

My family lives outside of a tier 2 city border, in what used to be farmland in the 90s.

They have Asianet FTTH at 1Gbps, but most of the video/streaming traffic ends at the CDN hosts in the same city.

That CDN push to the edge is why Hotstar is faster to load there - the latency on seeks isn't going around the planet.

matt-p · 2025-11-10T17:20:37 1762795237

That is really cool, but sad to see it's only at around 15% penetration.