Formal Verification in the Age of AI

For decades, research in formal verification has been guided by a simple mental model that I recently coined the formal verification triangle.

The formal verification triangle

The triangle captures a trade-off between three desirable properties:

Automation – the verification tool runs largely without human guidance
Scalability – the technique works on large real systems
Precision – the method can prove interesting properties, such as functional correctness

Historically, verification techniques could reliably achieve two of the three, but not all three simultaneously.

Approach	Automatic	Scalable	Precise
Static analysis	✓	✓	✗
Model checking	✓	✗	✓
Interactive theorem proving	✗	✓	✓

Static analysis scales to millions of lines of code but sacrifices precision. Model checking provides precise answers but struggles with large systems. Interactive theorem proving can scale and remain precise — but only through substantial human effort.

The triangle was never a theorem, but it described the practical limits of verification engineering remarkably well.

Until now.

What the Triangle Really Measured

All verification tools automate some work. The real question has always been what kind of labour can be delegated to machines.

Decision procedures — SAT solving, SMT solving, abstract interpretation — allowed machines to automate certain kinds of reasoning:

constraint solving
fixpoint computation
symbolic execution
bounded state exploration

But other tasks stubbornly remained human:

inventing lemmas
structuring proofs
discovering complex invariants
reorganising proof developments
repairing proofs after changes

Interactive theorem proving mechanised these forms of reasoning: proofs could be constructed and checked within a proof assistant, but the reasoning itself largely remained manual.

The price was labour.

The landmark verification of functional correctness for the ~9k line C implementation of the seL4 verified microkernel required roughly:

~200k lines of Isabelle/HOL proofs
completed in ~20 person-years of work

The triangle therefore reflected a constraint on which kinds of reasoning could realistically be automated.

A Striking Comparison

Recent progress in AI-assisted theorem proving and formalisation suggests that this constraint may be shifting.

An AI system recently produced a formal proof of the Fields Medal-winning sphere-packing results in dimensions 8 and 24 consisting of roughly 200k lines of proof in about two weeks.

While the similarity in proof size to the seL4 result is striking, the two efforts are hardly identical. The seL4 project required building a complete system model and proof architecture for a real operating system. Mathematical formalisation builds on extensive libraries and, in the case of the sphere packing result, a large body of human-written existing theory.

But the comparison is still illuminating.

For decades, verification engineers implicitly assumed that a proof development of that scale implied years of human effort. Very roughly speaking, during the seL4 project we would often talk about 10,000 lines of proof being about a year’s worth of effort.

That assumption may no longer hold. Recent work suggests this may be true even when proofs are written for virtually unknown proof assistants building on highly obscure proof libraries.

If the cost of producing and maintaining proofs drops by even an order of magnitude, the implications for verified systems are profound. As others have noted, verification could move from heroic one-off projects to something closer to routine engineering practice.

Verification as Decomposition with Feedback

What seems to make recent systems effective is not simply that AI can generate proofs.

Instead, they combine two ideas that verification researchers already know well: decomposition and feedback loops.

Large verification problems are not solved in one step. They are recursively decomposed into smaller subproblems. For example, compositional program logics decompose the problem of reasoning about a program into reasoning over each of its procedures, each of which is decomposed into reasoning over individual program commands, and so on, recursively. Mathematical proofs are decomposed into lemmas, each of which is decomposed into subgoals, which in turn can lead to proposing and proving additional lemmas, and so on, recursively.

Within this hierarchy of decomposed subproblems, reasoning proceeds inside a verification feedback loop with two key ingredients beyond recursive decomposition:

A correctness oracle
Rich feedback that enables repair

Interactive theorem provers naturally embody these ideas: decomposition through lemmas and proof subgoals, a trusted proof kernel that serves as a correctness oracle, and detailed proof state (unproved goals, missing assumptions, failed tactics) that provides rich feedback. This combination of hierarchical decomposition and oracle-guided feedback might explain why AI agents have so far been particularly successful at carrying out proofs using these systems.

This leads to a powerful recursive feedback loop:

propose → check → feedback → repair → decompose → repeat

The trusted kernel guarantees correctness, while the AI performs exploratory reasoning. At each iteration of the loop, the problem may be recursively decomposed into smaller subproblems, to which the same feedback loop can be applied.

Readers may recognise this recursive feedback loop as a form of hierarchical search through a structured proof space, guided by oracle feedback. Variants of this pattern appear in everything from interactive theorem proving and pen-and-paper reasoning with Hoare logic, to techniques such as Counterexample-Guided Abstraction Refinement (CEGAR) for model checking and Counterexample-Guided Inductive Synthesis (CEGIS) for program synthesis.

What is new is not the loop itself, nor of course recursive decomposition, but the range of reasoning tasks that can now participate in this recursive feedback loop. Historically, such loops were automated only for restricted forms of reasoning that could be encoded into decision procedures or specialised synthesis engines. Tasks such as discovering useful lemmas, structuring proofs, or maintaining large proof developments largely remained manual.

AI-assisted reasoning systems suggest that these previously human tasks may also be pulled into the loop. In that sense, the novelty is not the feedback pattern itself, but that we may now be able to automate that loop across myriad styles of reasoning.

The Role of Traditional Verification Techniques

This shift does not make traditional automated verification techniques, such as static analysis and model checking, obsolete.

Instead, it may highlight a different role for them.

Static analysis and model checking are already excellent sources of structured feedback:

model checkers produce counterexample traces
static analysers infer candidate invariants and abstractions
abstract interpretation provides over-approximations of system behaviour

These signals can guide exploration in an AI-driven verification loop.

For example:

a model checker might produce a counterexample trace that forces an AI agent to strengthen an invariant
a static analyser might suggest invariants that the agent attempts to prove in a theorem prover
abstract interpretation might provide approximations that guide proof search

Rather than replacing these techniques, AI may turn them into components of a feedback-rich verification ecosystem. Likewise for bug-finding methods, which allow for (often cheap) hypothesis falsification, whether logical methods like symbolic execution and concolic execution, incorrectness logic, or testing-based methods such as property-based testing or fuzzing.

It would also be a mistake to assume that the future of verification simply looks like automatically writing interactive proofs. That is the most obvious extrapolation of today’s tools, but it may not be the most interesting one.

A Call to Rethink Verification

The formal verification triangle assumed that automation meant decision procedures: push a button and get a yes/no answer.

AI suggests a broader notion of automation.

Automation can also mean delegating reasoning labour to agents who are able to recursively decompose subproblems and whose output is continuously checked by a trusted oracle and refined through feedback.

Once that kind of delegation becomes possible, the design space of verification systems changes dramatically.

In ten years, we may not be verifying software inside a single proof assistant. Instead we may see entirely new verification architectures built around the familiar recursive propose–check–feedback–repair-decompose loop, combining theorem provers, model checkers, static analysers, and novel tools specifically designed to harness the interplay between diverse styles of reasoning, combined in ways we have not yet explored.

Research has already begun exploring agentic architectures for proving properties of programs. Tu et al. employ LLMs to propose candidate invariants for C program fragments, automated analyses to falsify and refine those invariants, and feedback-driven generation of proof tactic scripts in the Rocq proof assistant. I expect that structural decomposition driven by proof strategy planning — for instance identifying missing intermediate lemmas, recursively spawning new proof tasks, and integrating their results back into parent proofs — will be a key next step to unlocking the full potential of these architectures.

The formal verification triangle did not describe a fundamental law of verification.

It described the limits of a particular generation of automation technology.

Those limits are starting to move.

The challenge now is to rethink verification in a world where previously undelegable reasoning labour can be delegated to machines.

Acknowledgements

This post was improved by much-appreciated feedback from Aaron Bembenek, Max von Hippel, and Mooly Sagiv, to whom I am grateful.