Lean Type-Checker PI (Decisive Experiment)

Using Lean type-checker error messages as privileged information for retry. This tests the hypothesis that the domain boundary is caused by verifier strictness: if the model gets specific compiler feedback about WHY it failed, can it find alternatives? Result: -1.2pp. Better than bare retry (-2.4pp) but still negative. Compiler feedback is necessary but insufficient for proof strategy revision.

-1.2pp (NEGATIVE)

LEAN NEGATIVE DECISIVE

Hypothesis

The "verifier quality" hypothesis predicts that Lean retry fails because the model only knows it failed, not WHY. If we provide the actual type-checker error messages (e.g., "type mismatch: expected Nat, got Int" or "unknown identifier 'foo'"), the model should be able to diagnose its failure and select a different proof strategy. This would close the gap between bare retry (-2.4pp) and structural PI (+3.3pp).

Prediction (if verifier-quality is correct): +1 to +3pp (approaching skeleton OPSD).

Prediction (if path-narrowness is correct): Still negative; error messages cannot create new proof paths that don't exist.

Actual: -1.2pp. The truth is in between: type-checker feedback halves the damage vs bare retry (from -2.4 to -1.2) but cannot turn negative to positive. Both hypotheses are partially correct.

Have-Skeleton OPSD

+3.3pp

Structural strategy guidance

Type-Checker PI

-1.2pp

Knows WHY it failed

Bare Retry

-2.4pp

Only knows it failed

Method

First attempt: Generate one proof per theorem from Kimina-1.5B at T=0.7.
Type-check: Run Lean 4 type-checker on each proof. Collect PASS/FAIL status AND the full error message for failures.
Retry with error: For failed theorems, prompt with the specific type-checker error: "Your proof failed with the following error: [ERROR MESSAGE]. Please try a different approach."
Filter: Keep only retry proofs that successfully type-check.
SFT: Fine-tune on (theorem, correct_retry) pairs.

This is identical to bare retry except the retry prompt includes the actual compiler error message, giving the model specific diagnostic information about why its first attempt failed.

Configuration

ModelKimina-1.5B

DatasetFLC-120k

Eval benchmarkMiniF2F (pass@1)

Training steps500

Learning rate2e-5

LoRA rank16

Seed42

VerifierLean 4 type-checker

PI typeType-checker error messages

Retry success rate~11% (slightly better than bare 8%)

Hardware1x H200 (p5en.48xl)

Runtime~14h

Results

Method	Baseline	Post-training	Delta	vs Bare Retry
OPSD have-skeleton (structural PI)	39.3%	42.6%	+3.3pp	+5.7pp better
Baseline (no training)	39.3%	39.3%	0.0pp	+2.4pp better
Type-checker PI retry	39.3%	38.1%	-1.2pp	+1.2pp better
Failure-diagnosis PI retry	39.3%	38.1%	-1.2pp	+1.2pp better
Bare retry (no PI)	39.3%	36.9%	-2.4pp	baseline
Heuristic 3-iteration	39.3%	35.7%	-3.6pp	-1.2pp worse

The type-checker PI halves the damage but cannot turn negative to positive. The error message helps the model avoid the SAME mistake (retry success rate: 11% vs 8%) but does not enable fundamentally different proof strategies. The model still explores minor variations of its original approach rather than switching to a different proof path. Only structural PI (have-skeleton) provides strategy-level guidance.

Why Type-Checker Errors Are Insufficient

Tactical vs strategic feedback: Type-checker errors are local ("this term has wrong type") while proof failures are global (wrong proof strategy). Knowing that Nat.succ_lt_succ failed does not tell the model to try induction instead of case analysis.
Error messages are symptoms, not diagnoses: A type mismatch at step 5 often means the proof structure was wrong at step 1. The model cannot infer the correct high-level change from a low-level error.
Retry still explores locally: Even with the error, the model makes minimal edits (fix the one failing tactic) rather than restructuring the proof. The 11% success rate (vs 8% bare) shows marginal improvement in local fixes, not strategic revision.
Structural PI provides strategy selection: Have-skeleton tells the model to "first prove X, then Y, then combine." This is strategy-level guidance that the type-checker cannot provide. The 4.5pp gap (+3.3 vs -1.2) measures the value of strategic vs tactical feedback.

Comparison with Failure-Diagnosis PI

An LLM-generated failure diagnosis (asking the model to first explain WHY its proof failed, then retry) gives the same -1.2pp result. This shows that the issue is not the FORMAT of the feedback (raw error vs natural language diagnosis) but its LEVEL: tactical feedback at any level of abstraction cannot substitute for structural strategy guidance.

PI Type	Source	Level	Result
Type-checker error	Lean compiler	Tactical (local)	-1.2pp
Failure diagnosis	LLM self-reflection	Tactical (explained)	-1.2pp
Have-skeleton	Teacher model	Strategic (global)	+3.3pp
None (bare retry)	-	None	-2.4pp

What This Resolves

This experiment resolves two open hypotheses from Day 2:

"Verifier quality as binding constraint": PARTIALLY CORRECT. Better verifier feedback (type-checker errors) improves over bare retry by 1.2pp, but cannot make retry positive. Verifier quality is A factor, not THE factor.
"Path diversity not the full story": RESOLVED. Both path narrowness AND strategic guidance deficiency explain Lean's domain boundary. The model needs to know WHICH proof strategy to use (structural PI), not just WHERE its current attempt went wrong (type-checker errors).

Implication for the Unified Theory

The updated theory has a clean hierarchy for Lean:

Structural PI (strategy guidance): +3.3pp - Tell the model WHICH approach to take
Tactical PI (error feedback): -1.2pp - Tell the model WHERE it went wrong (insufficient)
No PI (bare retry): -2.4pp - Tell the model only that it failed
Iterative heuristic: -3.6pp - Repeat the failure (compounds errors)

For math/code (high path diversity), ANY level of PI works because the model can find alternatives. For Lean (low effective diversity given the type-checker's strictness), only strategy-level PI enables the model to succeed.

Connection to Other Experiments

Lean Bare Retry (-2.4pp) - improved upon but still negative

Type-checker PI halves the regression (from -2.4 to -1.2) by helping the model avoid repeating the exact same error. But tactical improvement is not strategic improvement.

Have-Skeleton OPSD (+3.3pp) - the gap that matters

The 4.5pp gap between type-checker PI (-1.2) and structural PI (+3.3) is the measured value of strategy-level guidance for Lean. This is genuinely privileged information.

Failure-Diagnosis PI (-1.2pp) - identical result, different mechanism

LLM self-reflection on errors gives the same result as raw compiler output. The model's ability to diagnose failure does not improve on the compiler's diagnosis. Both are tactical.

Lean Heuristic 3-Iter (-3.6pp) - progressive degradation

Multiple rounds of retry-with-errors compound damage. Each round's "successful" retries are lower quality, and training on them progressively corrupts the model's proof capabilities.

Math Retry (+8.8pp) - why it works there

In math, ANY feedback (even gibberish) activates alternative reasoning paths because there ARE many valid approaches. In Lean, alternative paths barely exist (given type-checker strictness), so no form of retry can help.