Dashboard / Experiments / Lean Type-Checker PI
Lean Type-Checker PI (Decisive Experiment)
Using Lean type-checker error messages as privileged information for retry. This tests the hypothesis that the domain boundary is caused by verifier strictness: if the model gets specific compiler feedback about WHY it failed, can it find alternatives? Result: -1.2pp. Better than bare retry (-2.4pp) but still negative. Compiler feedback is necessary but insufficient for proof strategy revision.
-1.2pp (NEGATIVE)
LEAN
NEGATIVE
DECISIVE
Hypothesis
The "verifier quality" hypothesis predicts that Lean retry fails because the model only knows it failed, not WHY. If we provide the actual type-checker error messages (e.g., "type mismatch: expected Nat, got Int" or "unknown identifier 'foo'"), the model should be able to diagnose its failure and select a different proof strategy. This would close the gap between bare retry (-2.4pp) and structural PI (+3.3pp).
Prediction (if verifier-quality is correct): +1 to +3pp (approaching skeleton OPSD).
Prediction (if path-narrowness is correct): Still negative; error messages cannot create new proof paths that don't exist.
Actual: -1.2pp. The truth is in between: type-checker feedback halves the damage vs bare retry (from -2.4 to -1.2) but cannot turn negative to positive. Both hypotheses are partially correct.
Have-Skeleton OPSD
+3.3pp
Structural strategy guidance
Type-Checker PI
-1.2pp
Knows WHY it failed
Bare Retry
-2.4pp
Only knows it failed
Method
- First attempt: Generate one proof per theorem from Kimina-1.5B at T=0.7.
- Type-check: Run Lean 4 type-checker on each proof. Collect PASS/FAIL status AND the full error message for failures.
- Retry with error: For failed theorems, prompt with the specific type-checker error: "Your proof failed with the following error: [ERROR MESSAGE]. Please try a different approach."
- Filter: Keep only retry proofs that successfully type-check.
- SFT: Fine-tune on (theorem, correct_retry) pairs.
This is identical to bare retry except the retry prompt includes the actual compiler error message, giving the model specific diagnostic information about why its first attempt failed.
Configuration
ModelKimina-1.5B
DatasetFLC-120k
Eval benchmarkMiniF2F (pass@1)
Training steps500
Learning rate2e-5
LoRA rank16
Seed42
VerifierLean 4 type-checker
PI typeType-checker error messages
Retry success rate~11% (slightly better than bare 8%)
Hardware1x H200 (p5en.48xl)
Runtime~14h
Results
| Method | Baseline | Post-training | Delta | vs Bare Retry |
| OPSD have-skeleton (structural PI) | 39.3% | 42.6% | +3.3pp | +5.7pp better |
| Baseline (no training) | 39.3% | 39.3% | 0.0pp | +2.4pp better |
| Type-checker PI retry | 39.3% | 38.1% | -1.2pp | +1.2pp better |
| Failure-diagnosis PI retry | 39.3% | 38.1% | -1.2pp | +1.2pp better |
| Bare retry (no PI) | 39.3% | 36.9% | -2.4pp | baseline |
| Heuristic 3-iteration | 39.3% | 35.7% | -3.6pp | -1.2pp worse |
The type-checker PI halves the damage but cannot turn negative to positive. The error message helps the model avoid the SAME mistake (retry success rate: 11% vs 8%) but does not enable fundamentally different proof strategies. The model still explores minor variations of its original approach rather than switching to a different proof path. Only structural PI (have-skeleton) provides strategy-level guidance.
Why Type-Checker Errors Are Insufficient
- Tactical vs strategic feedback: Type-checker errors are local ("this term has wrong type") while proof failures are global (wrong proof strategy). Knowing that
Nat.succ_lt_succ failed does not tell the model to try induction instead of case analysis.
- Error messages are symptoms, not diagnoses: A type mismatch at step 5 often means the proof structure was wrong at step 1. The model cannot infer the correct high-level change from a low-level error.
- Retry still explores locally: Even with the error, the model makes minimal edits (fix the one failing tactic) rather than restructuring the proof. The 11% success rate (vs 8% bare) shows marginal improvement in local fixes, not strategic revision.
- Structural PI provides strategy selection: Have-skeleton tells the model to "first prove X, then Y, then combine." This is strategy-level guidance that the type-checker cannot provide. The 4.5pp gap (+3.3 vs -1.2) measures the value of strategic vs tactical feedback.
Comparison with Failure-Diagnosis PI
An LLM-generated failure diagnosis (asking the model to first explain WHY its proof failed, then retry) gives the same -1.2pp result. This shows that the issue is not the FORMAT of the feedback (raw error vs natural language diagnosis) but its LEVEL: tactical feedback at any level of abstraction cannot substitute for structural strategy guidance.
| PI Type | Source | Level | Result |
| Type-checker error | Lean compiler | Tactical (local) | -1.2pp |
| Failure diagnosis | LLM self-reflection | Tactical (explained) | -1.2pp |
| Have-skeleton | Teacher model | Strategic (global) | +3.3pp |
| None (bare retry) | - | None | -2.4pp |
What This Resolves
This experiment resolves two open hypotheses from Day 2:
- "Verifier quality as binding constraint": PARTIALLY CORRECT. Better verifier feedback (type-checker errors) improves over bare retry by 1.2pp, but cannot make retry positive. Verifier quality is A factor, not THE factor.
- "Path diversity not the full story": RESOLVED. Both path narrowness AND strategic guidance deficiency explain Lean's domain boundary. The model needs to know WHICH proof strategy to use (structural PI), not just WHERE its current attempt went wrong (type-checker errors).
Implication for the Unified Theory
The updated theory has a clean hierarchy for Lean:
- Structural PI (strategy guidance): +3.3pp - Tell the model WHICH approach to take
- Tactical PI (error feedback): -1.2pp - Tell the model WHERE it went wrong (insufficient)
- No PI (bare retry): -2.4pp - Tell the model only that it failed
- Iterative heuristic: -3.6pp - Repeat the failure (compounds errors)
For math/code (high path diversity), ANY level of PI works because the model can find alternatives. For Lean (low effective diversity given the type-checker's strictness), only strategy-level PI enables the model to succeed.
Connection to Other Experiments
Lean Bare Retry (-2.4pp) - improved upon but still negative
Type-checker PI halves the regression (from -2.4 to -1.2) by helping the model avoid repeating the exact same error. But tactical improvement is not strategic improvement.
Have-Skeleton OPSD (+3.3pp) - the gap that matters
The 4.5pp gap between type-checker PI (-1.2) and structural PI (+3.3) is the measured value of strategy-level guidance for Lean. This is genuinely privileged information.
Failure-Diagnosis PI (-1.2pp) - identical result, different mechanism
LLM self-reflection on errors gives the same result as raw compiler output. The model's ability to diagnose failure does not improve on the compiler's diagnosis. Both are tactical.
Lean Heuristic 3-Iter (-3.6pp) - progressive degradation
Multiple rounds of retry-with-errors compound damage. Each round's "successful" retries are lower quality, and training on them progressively corrupts the model's proof capabilities.
Math Retry (+8.8pp) - why it works there
In math, ANY feedback (even gibberish) activates alternative reasoning paths because there ARE many valid approaches. In Lean, alternative paths barely exist (given type-checker strictness), so no form of retry can help.