To add to this. I was going through devin's 'pass' diffs from SWE bench. Every o...

ekidd · on March 18, 2024

Thank you for reading the diffs and reporting on them.

And to be fair, lots of humans are already at least this bad at writing code. And lots of companies are happy with garbage code so long as it addresses an immediate business requirement.

So Devin wouldn't have to advance much to be competitive in certain simple situations where people don't care about anything that happens more than 2 quarters into the future.

I also agree that producing good code which meets real business needs is a hard problem. In fact, any AI which can truly do the work of a good senior software engineer can probably learn to do a lot of other human jobs as well.

nyrikki · on March 18, 2024

Architectural erosion is an ongoing problem for humans, but they don't produce tightly coupled low cohesion code by default at the SWE level the majority of the time.

With this quality of changes it won't be long until violations stack up to where further changes will be beyond any algorithms ability to unravel.

While lots of companies do only look out in the short term, human programers are incentivized to protect themselves from pain if they aren't forced into unrealistic delivery times.

At&t wireless being destroyed as a company due to a failed SAP migration that was largely due to fragile code is a good example.

But I guess if the developer jobs that will go away are from companies that want to underperform in the market due to errors and a code base that can't adapt to changing market realities, that may happen.

But I would fire any non intern programmer if they constantly did things like removing deprecation comments and introduced circular dependencies with the majority of their commits.

https://github.com/CognitionAI/devin-swebench-results/blob/m...

PAC learning is powerful but is still probably approximately correct.

Until these tools can avoid the most basic bad practices I don't see any company sticking to them in the long term, but it will probably be a very expensive experiment for many of them.

falcor84 · on March 18, 2024

Can't we just RLHF code reviews?

nyrikki · on March 18, 2024

RLHF works on problems that are difficult to specify yet easy to judge.

While RLHF will help improve systems, code correctness is not easy to judge outside of the simplest cases.

Note how on OpenAI's technical report, they admit performance on college level tests is almost exclusively from pre-training. If you look at LSAT as an example, all those questions were probably in the corpus.

https://arxiv.org/abs/2303.08774

falcor84 · on March 18, 2024

>RLHF works on problems that are difficult to specify yet easy to judge.

But that's the thing, that it seems that everyone here on HN (and elsewhere) finds it easy to judge the flaws of AI-generated code, and they seem relatively consistent. So if we start offering these critiques as RLHF at scale, we should be able to bring the LLM output to the level where further feedback is hard (or at least inconsistent), right?

ogogmad · on March 19, 2024

> You simply can't get past what Gödel and Rice proved with current technology.

Not this again. Those theorems tell you nothing about your concerns. The worst case of a problem is not equal to its usual case.