Code correctness should be checked automatically with the CI and testsuite. New tests should be added. This is exactly what makes sure these stupid errors don't bother the reviewer. Same for the code formatting and documentation.
This discussion makes me think peer reviews need more automated tooling somewhat analogous to what software engineers have long relied on. For example, a tool could use an LLM to check that the citation actually substantiates the claim the paper says it does, or else flags the claim for review.
I'd go one further and say all published papers should come with a clear list of "claimed truths", and one is only able to cite said paper if they are linking in to an explicit truth.
Then you can build a true hierarchy of citation dependencies, checked 'statically', and have better indications of impact if a fundamental truth is disproven, ...
Could you provide a proof of concept paper for that sort of thing? Not a toy example, an actual example, derived from messy real-world data, in a non-trivial[1] field?
---
[1] Any field is non-trivial when you get deep enough into it.
I'd say my expectation is papers should be minimal in their effect, and compounding. If your project proves new facts, either they should be clearly enumerable (with as much specificity as possible), or your project/presentation/paper should be broken up to the point your findings ARE enumerable.
hey, i'm a part of the gptzero team that built automated tooling, to get the results in that article!
totally agree with your thinking here, we can't just give this to an LLM, because of the need to have industry-specific standards for what is a hallucination / match, and how to do the search
One could submit their bibtex files and expect bibtex citations to be verifiable using a low level checker.
Worst case scenario if your bibtex citation was a variant of one in the checker database you'd be asked to correct it to match the canonical version.
However, as others here have stated, hallucinated "citations" are actually the lesser problem. Citing irrelevant papers based on a fly-by reference is a much harder problem; this was present even before LLMs, but this has now become far worse with LLMs.
Yes, I think verifying mere existence of the cited paper barely moves the needle. I mean, I guess automated verification of that is a cheap rejection criterion, but I don’t think it’s overall very useful.
this is still in beta because its a much harder problem for sure, since its hard to determine if a 40 page paper supports a claims (if the paper claims X is computationally intractable, does that mean algorithms to compute approximate X are slow?)