Lean Retry (No PI)

Retry HURTS theorem proving. Bare "try again" on Lean gives -2.4pp, the opposite of math (+8.8pp) and code (+23.2pp). This is the exception that proves the rule: retry only works when solution paths are diverse enough for the model to find alternatives.

-2.4pp (REGRESSION)

LEAN NEGATIVE DOMAIN BOUNDARY

Hypothesis

The retry mechanism works by activating alternative reasoning paths. If Lean theorem proving has NARROW solution paths (few valid proof strategies per theorem), then retry should be less effective because the model cannot find alternatives through random exploration. The prediction was +1.5-2.5pp (weaker than math, but still positive).

Prediction: +1.5-2.5pp (weaker than math due to narrow paths).

Actual: -2.4pp. Retry not only fails but actively hurts. Worse than no training at all.

Have-Skeleton OPSD (structural PI)

+3.3pp

PI tells the model WHICH strategy

Bare Retry (no PI)

-2.4pp

Model cannot find alternatives

Method

First attempt: Generate one proof per theorem from Kimina-1.5B at T=0.7.
Type-check: Run Lean 4 type-checker. Classify as PASS (proof compiles) or FAIL.
Retry failures: For failed theorems, prompt "Your proof failed to type-check. Please try again with a different approach."
Filter: Keep only retry proofs that successfully type-check.
SFT: Fine-tune on correct retry proofs.

Identical pipeline to math/code retry, but using the Lean type-checker as the binary verifier.

Configuration

ModelKimina-1.5B

DatasetFLC-120k

Eval benchmarkMiniF2F (pass@1)

Training steps500

Learning rate2e-5

LoRA rank16

Seed42

VerifierLean 4 type-checker

Retry success rate~8% (very low)

Hardware1x H200 (p5en.48xl)

Runtime~12h (Lean check slow)

Results

Method	Baseline	Post-training	Delta
OPSD have-skeleton	39.3%	42.6%	+3.3pp
OPSD proof-completion	39.3%	41.4%	+2.1pp
Baseline (no training)	39.3%	39.3%	0.0pp
STaR retry (no PI)	39.3%	36.9%	-2.4pp

Retry HURTS Lean. This is not a null result; it is actively negative. Training on the model's own retry solutions makes it WORSE at theorem proving. The retry solutions are low-quality (only ~8% succeed) and the successful ones may teach superficial patterns (e.g., using sorry in different ways) rather than genuine proof strategies.

Why Retry Hurts Lean

Extremely low retry success rate (~8%): In math, 49% of retries succeed. In Lean, only ~8% do. The model's second attempt almost never produces a valid proof because the search space is too constrained.
Narrow solution paths: Most MiniF2F theorems have 2-3 viable proof strategies. The model's first attempt typically uses its strongest strategy. On retry, it tries minor variations of the SAME strategy (different tactic orderings) rather than fundamentally different approaches.
Low-quality successes: The ~8% of retries that do compile may succeed for superficial reasons (e.g., the theorem happens to be provable by simp or omega and the retry accidentally hits this). Training on these does not teach genuine proof planning.
Negative transfer: The few "successful" retry proofs may teach the model bad habits (over-relying on automation tactics, avoiding structured decomposition) that make it worse on harder theorems.

Training Curves

At: /data/ughai-sandbox/opsd_experiments/lean_retry/. Very few training examples (~300 from 120k attempts at 8% success). SFT on this sparse data overfits quickly. Validation loss increases after step 100.

Interpretation

This result completes the domain-dependent picture:

Domain	Solution diversity	Retry effect	PI effect
Math	High (10+ approaches)	+8.8pp	+0.04pp (null)
Code	High (many implementations)	+23.2pp	+2.4pp (weak)
Lean	Low (2-3 strategies)	-2.4pp	+3.3pp

The rule: When solution-path diversity is high, retry works and PI is irrelevant. When diversity is low, retry fails and structural PI is genuinely needed. The boundary is determined by whether the model can FIND alternative correct strategies through random exploration.

Connection to Other Experiments

Have-Skeleton OPSD (+3.3pp) - the contrast

Structural PI succeeds where retry fails. The 5.7pp gap (3.3 vs -2.4) is the genuine value of PI in constrained domains. PI provides strategy selection that retry cannot achieve.

Math Retry (+8.8pp) - the opposite outcome

Math has high solution diversity, so retry finds alternatives easily. The contrast with Lean confirms: diversity is the moderating variable.

Prediction Note (pre-registered) - partially correct

The pre-registered prediction was +1.5-2.5pp (weaker than math, but positive). The actual -2.4pp is below the predicted lower bound of +0.5pp. The prediction correctly identified the direction (weaker than math) but underestimated how badly retry would fail.

Paper Implication - domain boundary

This defines the paper's key nuance: "retry is all you need" holds for math and code, but formal verification needs structural PI. This gives PI distillation a defensible niche.