This is just fascinating, isn't it? I have competing thoughts in my head: 1. It'...

This is just fascinating, isn't it? I have competing thoughts in my head:

1. It's software that offers non-deterministic output and as such is fiendishly difficult to write realistic end-to-end tests for. Of course it's experiencing regressions. Heisenbugs are the hardest bugs to catch and fix, but having millions of users will reliably uncover them. And for an LLM, almost every bug is a Heisenbug! What if OpenAI improved GPT4 on one metric and this "nerfed" it on some other more important metric? That's just a classic regression. And what would robust, realistic end-to-end tests even look like for GPT4?

2. It's software that presents itself as a human on the Internet—even worse, a human representing an institution. Of course nobody trusts it. Everyone is extremely mistrustful of the intents and motivations of other humans on the Internet, especially if those humans represent an organization. I co-ran the tiny activist nonprofit Fight for the Future for years, and it was really amazing how common it was for comments in online spaces to assume the worst intentions; I learned to expect it and react extremely patiently. Imagine what it's like for OpenAI, building a product that has become central to peoples' workflows. Of course people are paranoid and think they're the devil, and are able to hallucinate all manner of offense and model it with every paranoid theory imaginable. The funny thing is, the more successful GPT4 is at seeming human, the less some people will trust it, because they don't trust humans! And the smarter and more successful it gets, the less some people will trust it! (How much do most people trust smart, successful public figures?)

3. Maybe an overall improvement for most users (one that the data would strongly suggest is a valid change and that would pass all tests) is a regression for some smaller set of users that aren't expressed in the tests. There might be some pairs of objectives that still present genuinely zero-sum tradeoffs given the size of the model and how it's built. What then? The usefulness of GPT4 is specifically that it is general purpose, i.e. that the massive cost of training it can be amortized across tons of different use cases. But intuitively there must be limits to this, where optimization for some cases comes at a cost to others, beyond the oft-cited of Bowdlerization. Maybe an LLM is just yet another case in the real world where sharing an important resource with lots of people is a hard problem.

If I were at OpenAI, I would want some third party running a community-submitted end-to-end test suite on each new release, with accounts that were secret to OpenAI and from unknown IP addresses—via Tor Snowflake bridges or something.

It's so tempting when running into user-reported Heisenbugs to trick oneself into ignoring users and not accepting that you've shipped a real regression. In addition to wanting the world to know, I would want to know.

But there's a real question of what these community-curated tests would even be, since they'd have to be automated but objective enough to matter. Maybe GPT4 answers could be rated by an open source LLM run by a trusted entity, set to temperature: 0? Or maybe some tests could have unambiguous single-string answers, without optimizing for something unrealistic? And the tests would have to be secret or OpenAI could just finetune to the tests. It's tricky, right?