Auto-formalization I: Keep Trying
Lean is a proof assistant. A programming language in which you write your mathematical statements and proofs. Then if your proof compiles, it is correct. Well, up to bugs in the language or compiler, which do happen, but let’s not worry about bugs here for a moment. In principle this is 100% guaranteed and deterministic. No AI and no numerics in this part.
When we (mathematicians) have a written up paper, it does contain our written accounts of the proofs. But math is complicated and humans make mistakes and that’s why Lean exists. Writing a Lean proof for an existing natural language proof should in principle be a translation task. And LLMs are good at translation, right? So I wanted to try. This is my general stance towards AI: companies make bold claims and usually I cannot reproduce the glorious AI success stories on my problems and there is this gap. I also think it is our responsibility as researchers to understand what is going on in this space and what the capabilities of LLMs truly are. I believe that even the strongest criticism or rejection of LLM tools must come from knowledge and experience, not fundamentalism. This blog post expresses this point of view well.
Another motivation to finally grasp Lean came from the 2025 Heidelberg Laureate Forum. I heard that there was a talk in which someone told the young researchers that it is pointless to learn Lean since the machines will handle it all. Manon and I discussed this in Eigenraum episode 54 and it annoyed me enough to do two things: finally learn some Lean myself with my human brain and also try this LLM-based auto-formalization.
In February 2026 the 1st Proof competition added urgency to the Lean question. Hundreds of LLM-generated informal proofs were submitted, and no human was particularly interested in checking them. If LLMs can produce mathematics that no one wants to verify by hand, the natural fix is to make them produce formal mathematics so that deterministic computers can check if the needle in the haystack is in fact a needle. My thoughts on the value of new mathematics with inscrutable but verifiably correct proofs will be in some other blog post.
Now what should I auto-formalize? I chose my 2010 paper “Binomial edge ideals and conditional independence statements,” with Herzog, Hibi, Hreinsdóttir, and Rauh. It is my most-cited paper and I know of one little problem in one of the proofs where a lemma that does not exist (i.e., is not true) is invoked. Rest assured that the proofs all check out if one fixes that problem. That fix is at a level an off-the-shelf LLM can handle. And there exist more general proofs in the literature.
Here is how some simple definitions look in Lean:
variable {K : Type*} [Field K] {V : Type*} [LinearOrder V]
def BinomialEdgeVars (V : Type*) := V ⊕ V
def binomialEdgeIdeal (G : SimpleGraph V) :
Ideal (MvPolynomial (BinomialEdgeVars V) K) :=
Ideal.span { f | ∃ i j, G.Adj i j ∧ i < j ∧ f = x i * y j - x j * y i }
def IsClosedGraph (G : SimpleGraph V) : Prop :=
(∀ {i j k : V}, i < j → i < k → j ≠ k → G.Adj i j → G.Adj i k → G.Adj j k) ∧
(∀ {i j k : V}, i < k → j < k → i ≠ j → G.Adj i k → G.Adj j k → G.Adj i j)
noncomputable def binomialEdgeMonomialOrder :
MonomialOrder (BinomialEdgeVars V) := MonomialOrder.lex
Before I describe what I did, the result is on GitHub and the LLM also created (with much help and instruction) this overview page for human consumption: tom111.github.io/BEI-lean.
This started February 17, 2026 and all proofs were complete on April 23, 2026. Since then I have been refactoring it, that is, improving code quality and shortening proofs.
What kind of math?
The paper has some graph theory and combinatorics and connects this to commutative algebra. Ideal theory and graphs are in Mathlib, the community library of formalized mathematics for Lean.
Aside: We need to talk about Mathlib. So humanity has set out to formalize all mathematics and put it into one big GitHub repository. Who are the gatekeepers of this repository? Who decides on style? Will there be only one proof of each statement? What is the best proof of each statement? There is a whole set of community organization questions about Mathlib that I also don’t want to discuss here. This Quanta article makes the point that this is like a new Bourbaki movement.
In practice, when doing what I aim to do, Mathlib only carries so far. For example, the proof of Proposition 1.6 starts like this:
We will show that $S/\mathrm{in}_{\lt}(J_G)$ is Cohen–Macaulay. This will then imply that $S/J_G$ is Cohen–Macaulay as well.
This is so well known to experts that there is not even a citation in the paper. Alas, it does not exist in Mathlib and this theorem (that you can pass from $J_G$ to $\mathrm{in}_{\lt}(J_G)$) is quite a bit of textbook material. In the paper there are several innocent-looking one-liners that get us into big trouble, LLM or not.
Two of the theorem groups in the paper rely on Gröbner bases and Cohen–Macaulay rings, which are not in Mathlib. But I’m of course not the first to notice this. People have already written lots of code and it is sitting in separate repos or big pull requests waiting for review. As a consequence of this, almost a third of the whole project here consists of code that is not specific to the arguments in the paper and should be in Mathlib.
Setup
My auto-formalization uses Claude Code with a Claude Code Max 5x subscription. I also used the MCP server lean-lsp-mcp which gives Claude the means to use Lean directly like I would do in an editor. With this the LLM can check snippets of code, get completion hints, look up definitions easily, etc. Without that direct feedback loop, the model would write new code, then call Lean on the command line, wait, read error messages and generally be slower and rely more on trial and error. Another good thing to have is a local install of Loogle, the Lean search engine. The LLM will want to use that a lot and the web version is slower and will block you if your LLM uses it too much.
Alongside the MCP server, I used Cameron Freer’s lean4-skills, a set of Claude Code skills tailored for Lean work. A skill is nothing but a bunch of markdown files that get auto-added to prompts and tell the LLM to please please behave in a certain way. This is part of the idiocy of LLMs that we forget everything that computer science taught us about the separation of data and code and now the instructions are just more data and we can only keep our fingers crossed and hope for some desirable behaviour of the LLM. More often than one likes, the LLM does not do what the instructions say and forgets to use the MCP or suddenly starts to hallucinate some Python tools that don’t exist. Cameron fortunately answered my questions on Zulip and helped me work around several bugs (with sub-agents in Claude Code). And this is a cat-and-mouse game. The behaviour of Claude Code and the LLMs changes very quickly.
Speaking of Zulip: the Lean community’s Zulip is enormous, with hundreds of messages per day across dozens of streams covering Mathlib development, newbie questions, tooling, and ongoing (auto-)formalizations. It’s way too much for me to follow, but I checked it from time to time anyway.
With everything set up, my project started with the original .tex file from 2010 and zero Lean code. As one does in this agentic coding world, I first created a CLAUDE.md file. Well, to be precise: I let Claude create one. This is a markdown file that Claude Code automatically reads at the start of every session and treats as standing instructions for the project (until it doesn’t). Then I let Claude create the basic layout of the paper and start formalizing the basic definitions and theorem statements. In Lean you can always put sorry for a proof that you plan to fill in later.
I spent a significant amount of time trying to understand the basic definitions. With auto-formalized mathematics it is a bit like with the paper-clip optimization story. The LLM is set to achieve a certain goal. And if there is some exploit that also achieves this goal, it might find it. So if the definitions can be subtly changed to not mean what is in the paper but only to look alike and thus make proving the result much easier, the LLM may go that route. Auto-formalization will never relieve us of the task of checking the definitions and statements. This is also a major reason why we need to learn Lean ourselves.
Success comes in waves
After this setup there were some exciting weeks. I would regularly log into my Claude computer and cheer on Claude, admire the progress (sorry count going down). Claude is so obedient! And yes, I set up a separate computer on which Claude was allowed to do basically what it wanted without being babysat and without access to all my other data.
After the first success (Theorem 1.1 being formalized quickly), Theorem 2.1 was an obstacle. This is the Gröbner basis computation that has the problematic proof. The proof generally contains many omissions and was a major obstacle for the formalization. But this is the whole point. We humans often hide our errors in formulations like “it suffices to check …” or “the other case is similar”. Lean does not buy that! Here things stalled and the LLM kept producing more and more complex code without making real progress. I tried to point it at a different proof by my co-author Johannes Rauh that proves a more general statement. But this was not good either: The LLM would start to do this LLM thing where it switches back and forth between two different approaches. It was just confusion with partial proofs becoming longer and longer. No matter how much detail I put into writing up the proof, it could not do it.
I eventually broke this deadlock using my friend Christian Stump’s site math.sciencebench.ai. This collects benchmark problems for AI math, but it also has a multi-chat, a tool to ask the same question to 5 new LLMs at the same time and combine their answers. After revising the informal proof with this, specifically instructing the LLMs to write the informal proof with auto-formalization in mind, the formalization got unstuck. I must say that I think GPT-5.4 Pro was the breakthrough here. These Pro models by OpenAI are very good at math reasoning (and take forever to produce their answer).
Guides and Prompts
After the Theorem 2.1 experience, I changed my method. Rather than assigning Claude high-level goals like “formalize the next theorem”, I started to use a second LLM to write better prompts and even more elaborate guides. For this I used ChatGPT and OpenAI’s Codex because I had the feeling that OpenAI’s models are slightly better at informal math. I also tried to let them write the Lean code, but they do suck at that and Claude is much better. It’s very weird. Such a guide would be like a very elaborate todo item or plan. It specifies a proof strategy and whatever research it has already done on Mathlib lemmas, expected intermediate statements, pitfalls to avoid, etc. Here is how this could look:
Work on the next HH-side Proposition 1.6 packet.
Start by reading:
- AGENTS.md
- BEI.tex around Proposition 1.6 and the Herzog–Hibi reduction
- guides/INDEX.md
- the new guide for the HH bridge/localization blocker from the last round
- guides/work_packages/PROP_1_6_CM_TRANSFER.md
Then inspect:
- BEI/Equidim.lean
- toMathlib/CohenMacaulay/Defs.lean
- toMathlib/CohenMacaulay/Basic.lean
Current verified state:
- `sum_XY_isSMulRegular_mod_diagonalSum` is proved;
- `bipartiteEdgeMonomialIdeal_isWeaklyRegular` is now proved;
- the remaining HH-side gap is to convert that weakly regular sequence into
a genuine Cohen–Macaulay theorem;
- the recommended route is the graded CM approach: prove a theorem strong
enough to pass from CM at the irrelevant maximal ideal to global
`IsCohenMacaulayRing`, or isolate the exact graded/local theorem needed.
Your task for this round:
1. Investigate the smallest truthful theorem that lets the HH quotient become
globally Cohen–Macaulay from the proved weakly regular sequence at the
graded/irrelevant-maximal level.
2. If such a theorem is already within reach from the current local CM
infrastructure, implement it and use it to prove the HH-side CM theorem
for the bipartite edge-ideal quotient.
3. If it is not yet within reach, isolate the exact missing theorem or
localization/graded bridge lemma, and write a self-contained guide for
that blocker.
4. Update TODO.md and FORMALIZATION_MAP.md if theorem status changes.
5. Run `lake build` before finishing.
Important constraints:
- stay paper-faithful;
- do not present equidimensionality as Cohen–Macaulayness;
- do not switch to the Gröbner transfer theorem yet;
- do not reopen the finished weakly-regular-sequence plumbing unless a real
defect is found;
- prefer the smallest theorem that honestly closes the HH-side CM packaging
step.
It’s annoying to read, I know. The LLMs constantly come up with this jargon-heavy talk like “paper-faithful”, “HH-side packet”, “blocker”, etc. But well, as long as no human needs to read this in detail, fine.
In parallel, when Codex was not writing guides or prompts, it was also building the project’s homepage mostly on its own. There the jargon was a real problem: the human-facing pages came out full of this slop and needed several rounds of scrubbing before they were readable.
In any case, now I was no longer vibe-proofing but orchestrating two LLM agents. Beautiful.
Cohen–Macaulay: LLMs read Bruns–Herzog
Then came the Cohen–Macaulay results: Proposition 1.6, Corollaries 3.4 and 3.7. These were the last holdouts and I was about to give up. There was just so much textbook material missing in Mathlib, but in the end I decided to have the LLM write the missing infrastructure. It’s now in a local toMathlib/ folder: definitions and basic lemmas for graded Cohen–Macaulay rings, regular-quotient transfer, finite-free Cohen–Macaulay, localization-to-CM bridges, height additivity, graded irrelevant ideals, graded prime avoidance. One third of the repository is this code. And to be honest, much of it is copied and adapted from the Cohen–Macaulay pull requests mentioned above.
And putting this in a toMathlib folder is euphemistic. It will never go into Mathlib. I just hope that it appears in Mathlib eventually and I can make this folder smaller. Building this took Claude and Codex weeks again, but afterwards Proposition 1.6 went through on 2026-04-22 and Corollaries 3.4 and 3.7 followed the next morning. Then finally we had a commit “paper is sorry-free.”
Aftermath
This formalization at the peak had around 39,500 lines of Lean code. And as it is all LLM-generated, it certainly contains many detours and idiotic proofs. Who wants to read that, maintain that, refactor that? Certainly not I!
I’m still having the LLMs work on the worst parts, find general lemmas, reduce code smell and number of lines. As of this writing we are down to about 36,500 lines. In these refactors I also learned how to have LLMs first read everything, collaborate with other LLMs that deal with the math and create context, etc.
Cost
To wrap up: I am positively surprised by how this went. I am convinced that the formalization is right and the theorems are all truly proven now. In the formalization world people talk about the de Bruijn factor as a measure of how much extra effort a formalized proof is over an informal proof. I think using LLMs this factor shrinks A LOT.
In terms of money, I spent roughly €400 (out of my own pocket) on this project so far. This includes a ChatGPT Pro account for one month and 3 months of Claude Code Max 5x. Of course my (very valuable!!) time is not accounted for. I also used these LLM accounts for other experiments during that period, not only binomial edge ideals.
And then, each publication in one of the DEAL journals costs the taxpayer north of €2000, so there is also another factor there: How much more expensive does it become to create a formalization together with your new publication?
I think for the good of mathematics we can accept these costs. There will be a follow-up post with lessons and more outlook.