While somewhat off-topic, I had an interesting experience highlighting the utility of GitHub's Copilot the other day. I decided to run Copilot on a piece of code functioning correctly to see if it would identify any non-existent issues. Surprisingly, it managed to pinpoint an actual bug. Following this discovery, I asked Copilot to generate a unit test to better understand the identified issue. Upon running the test, the program crashed just as Copilot had predicted. I then refactored the problematic lines as per Copilot's suggestions. This was my first time witnessing the effectiveness of Copilot in such a scenario, which provided small yet significant proof to me that Language Models can be invaluable tools for coding, capable of identifying and helping to resolve real bugs. Although they may have limitations, I believe any imperfections are merely temporary hurdles toward more robust coding assistants.
Copilot at present capabilities is already so valuable that not having it in some environment gives me the „disabledness feeling“ that I otherwise only get when vim bindings are not enabled. Absolute miracle technology! I‘m sure in the not too distant future we‘ll have have privacy preserving, open source versions that are good enough to not shovel everything over to openai
That sounds like very basic code review - which I guess is useful in instances where one can't get a review from a human. If it has a low enough false-positive rate, it could be great as a background CI/CD bot that can chime in the PR/changeset comments to say "You may have a bug here"
One nice thing about a machine code reviewing is no tedious passive-aggressive interactions or subjective style feedback you feel compelled to take etc.
Identifying potential bugs within a unit is only a part of a good code review; good code reviews also identify potential issues with broader system goals, readability, and idiomaticness, elegance and "taste" (e.g. pythonicity in Python) which require larger contexts than LLMs can currently muster.
So yes, the ability to identify a bug and providing a unit test to reproduce it is rather basic[1], compared to what a good code review can be.
1. An org I worked for had one such question
for entry-level SWEs interviews in 3-parts: What's wrong with this code? Design test cases for it. Write the correct version (and check if the tests pass)
Sharing knowledge, improving code quality, readability and comprehensability, reviewing test efficacy and coverage, validating business requirements and functionality, highlight missing features or edge cases, etc. AI can fulfill this role, but it does so in addition to other automated tools like linters and whatnot; it isn't as of yet a replacement for a human, only an addition.
The better your code is before submitting it for review, the smoother it'll go though. So if it's safe and allowed, by all means, have copilot have a look at your code first. But don't trust that it catches everything.
Calling it 'very basic' actually exalts the concept of code reviews, because the ideal code review is more than just identifying bugs in the code under review.
If I were to call the Mercedes A-Class a 'very basic Mercedes', it implies my belief in the existence of superior versions of the make.
Try it on a million line code base where it's not so cut and dry to even determine if the code is running correctly or what correctly means when it changes day to day.
"A tool is only useful if I can use it in every situation".
LLM's don't need to find every bug in your code - even if they found an additional 10% of genuine bugs compared to existing tools, it's still a pretty big improvement to code analysis.
In reality, I suspect the scope is much higher than 10%.
If it takes you longer to vet hallucinations than to just test your code better, is it an improvement? If you accept a bug fix for a hallucination that you got too lazy to check because you grew dependent on AI to do the analysis for you, and the bug "fix" itself causes other unforeseen issues or fails to recognize why an exception in this case might be worth preserving, is it really an improvement?
What if indeed. Most static analyses tools (disclaimer: anecdotal) have very little false positives these days. This may be much worse in C/C++ land though, I don't know.
But also a much much shorter attention span and tolerance for BS.
If you ask the LLM to analyze those 1000000 lines 1000 at a time, 1000 times, it’ll do it, with the same diligence and attention to detail across all 1000 pages.
Ask a human to do it and their patience will be tested. Their focus will waver, they’ll grow used to patterns and miss anomalies, and they’ll probably skip chunks that look fine at first glance.
Sure the LLM won’t find big picture issues at that scale. But it’ll find plenty of code smells and minor logic errors that deserve a second look.
Ok, why don't you run this experiment on a large public open source code base? We should be drowning in valuable bug reports right now but all I hear is hype.
While true, on the other hand an AI is a tool, and can have a much larger context size, and it can apply all of that at once. It also isn't limited by availability or time constraints, i.e. if you have only one developer that can do a review, and the tooling or AI can catch 90% of what that developer would catch.
I've separated 5000 line class into smaller domains yesterday. It didn't provide end solution, it wasn't perfect, but gave me a good plan where to place what.
Once it is capable to process larger context windows it will become impossible to ignore.
That’s rather an exception in my experience. For unit-tests it starts hallucinating hard once you have functions imported from other files. This is probably the reason most unit tests in their marketing materials are things like fibonacci…
How did you prompt Copilot to identify issues? In my experience the best I can do is to put in code comments of what what I want a snippet to do and copilot tries to write it. I haven't had good luck asking copilot to rewrite existing code. Nearest I've gotten is:
// method2 is identical to method1 except it fixes the bugs
public void method2(){
These things are amazing when you first experience but I think in most cases the user fails to realise how common their particular bug is. But then you also need to realise there maybe bugs in what has been suggested. We all know there are issue with stack overflow responses too.
Probably 85% of codebase are just rehashes of the same stuff. Co-pilot has seen it all I guess.