Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is anyone working on applying these techniques to formal verification of software?

My limited understanding of Rust is that it applies a fixed set of rules to guarantee memory safety. The rules are somewhat simple and limiting, for ease of understanding and implementation, but also because of undecidability.

Programmers run into situations where they know that their code won't cause memory errors, but it doesn't follow the rules. Wouldn't it be cool if something like Aristotle was integrated into the compiler? Any code for which a proof of correctness could be written would pass/compile, without having to add more and more rules





We are! We very recently announced some results on formally proving the correctness of programs: https://harmonic.fun/news#blog-post-verina-bench-sota

Formal methods are cool because, by contrast to tools like the borrow checker, you can prove some very "nonlocal" properties: this system does not deadlock, or it makes progress at least every N steps, etc.


Does Aristotle produce TLA+ output?

For example can it read rust async code and prove that there are no deadlocks in TLA+, or some equivalent in Lean?


TLA+ is generally used to specify a "toy model" of some complex distributed system. It's not intended for end-to-end proof, for that you'd just use Coq/Rocq or Lean itself. Lean is certainly expressive enough, but you'll have to translate the time and non-determinism modalities of TLA+ as part of the Lean development.

How is “this system doesn’t deadlock” not the same as the halting problem?

Proving that a particular program terminates does not require deciding the halting problem on arbitrary programs (same for deadlock freedom)

Deadlock is literally a halting problem.

We can't know for every possible program if it halts or not, but the complexity of programs we can determine is increasing as tools and techniques get better


An issue with this approach is that it may not be robust. That is, you could run into a casr where a minor modification of your program is suddenly not provable anymore, even though it is still correct. The heuristic (AI or otherwise) has necessarily limits, and if your are close to the "edge" of its capabilities then a minor change could push it across.

If the proof is rooted in the programmer's understanding who can give proof hints to the prover then any modification of the program can then be accompanied with a modification of the hints, still allowing automatic proofs. But if the human has no clue then the automatic system can get stuck without the human having a chance to help it along.


The same is true for optimization. One small change and the compiler's optimizer doesn't know anymore how to optimize the code, and your code is now slow. And there is no way for a programmer to fix it except by rolling back their changes or by inspecting the assembly output.

Formal verification of program correctness is also (for obvious reasons) key to unlocking AI-driven synthesis (i.e. 'vibe' coding) of "correct" programs that will verifiably meet the given spec.

It will certainly help - but its an extremely high bar. Almost all formal verification of software today is "does this pass the typechecker"?.

Now this captures some errors, but it doesn't really capture high level ones (is this program guaranteed to not deadlock is a hard one), and it doesn't capture the one that is important for business purposes (does this do what the customer wants). That requirement is more important than correctness (vitness all the software that is described as "crap", but is nonetheless widely used).

I don't think this is a required key to unlocking vibe coding. That seems to be easy: does this provide business value? And there the answer seems roughly to be "yes".


Not all aspects of a spec can be formally encoded. But even half-way houses are good.

Eg you can give the vague spec 'build me a todo list app', but you can still formally prove that everything your app does finishes, or even that it finishes in reasonable time.


Yes. For sure we will never be able to 100% automate the whole SWE process. As you say, the first input is a human wish, and there comes the joke of the genie that always screw the wishes by leaving something obvious out, because not explicitly specified. Also I think at some point the halting problem will make some programs impossible to test. But it would so great, program in a loose syntax, but with more safety than Rust and Ada together

> Also I think at some point the halting problem will make some programs impossible to test.

No, not at all. The halting problem isn't much of a problem here.

To elaborate: yes, it's pretty much impossible to decide whether an arbitrary programme will halt. But we aren't dealing with arbitrary programmes, you carefully have your agent craft programmes that are easy to prove correct.

There are languages available today whose type systems already only let you write terminating programmes. See eg https://news.ycombinator.com/item?id=32102203 the Dhall language. Or Agda or Lean itself (unless you specifically opt out via the 'partial' keyword. But it's trivial to check whether someone used 'partial'.)

If your agent write a programme that's not easy to prove to be terminating, you don't try harder to prove. You just flag that as an error and have the agent try again.

Just like as a human code reviewer you reject Pull Requests that are too complicated to understand: you don't even bother figuring out whether they are technically correct or not.


Thanks. I expressed it with doubt, because I assume there had to be some way around.

If I understood correctly, the halting problem states you cannot make a program to test any program. But you can do if you test a relatively small, possibly finite and well defined subset of programs, right?


> If I understood correctly, the halting problem states you cannot make a program to test any program. But you can do if you test a relatively small, possibly finite and well defined subset of programs, right?

In the standard theories everything becomes trivial, if you go finite.

The trick is not so much that you have a well-defined subset, but that you allow your programme analyser to say 'Eh, I don't know'.

The halting problem is impossible to solve in general, if you analyser has to answer "Definitely halts" or "Definitely runs forever" for any given programme you present it with. If you give a third option "Eh, I don't know", you can solve the problem.

Trivially you might always say "Eh, I don't know" and never be wrong. But we can write useful analysers that try their best not to say "Eh, I don't know".

One example is type checkers in compilers. When they detect that you are eg trying to add numbers and strings, then something is definitely wrong with your programme. But when they let your programme through, it doesn't mean your programme is flawless.

The opposite is what you get with prove assistants like we were discussing: if they prove your programme correct, it's definitely according to the specs. But if the fail, that doesn't mean that your programme is definitely incorrect; they might have just failed to find a proof.

There doesn't need to be a need and simple definition of programmes your theorem prover works vs fails on. Eg if you upgrade the AI agent that helps you find the proof (or hire a smarter human to drive it) they might suddenly find a proof where they struggled previously.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: