Lean Type-Checker PI (Decisive Experiment)

Using Lean type-checker error messages as privileged information for retry. This tests the hypothesis that the domain boundary is caused by verifier strictness: if the model gets specific compiler feedback about WHY it failed, can it find alternatives? Result: -1.2pp. Better than bare retry (-2.4pp) but still negative. Compiler feedback is necessary but insufficient for proof strategy revision.

-1.2pp (NEGATIVE)
LEAN NEGATIVE DECISIVE

Hypothesis

The "verifier quality" hypothesis predicts that Lean retry fails because the model only knows it failed, not WHY. If we provide the actual type-checker error messages (e.g., "type mismatch: expected Nat, got Int" or "unknown identifier 'foo'"), the model should be able to diagnose its failure and select a different proof strategy. This would close the gap between bare retry (-2.4pp) and structural PI (+3.3pp).

Prediction (if verifier-quality is correct): +1 to +3pp (approaching skeleton OPSD).

Prediction (if path-narrowness is correct): Still negative; error messages cannot create new proof paths that don't exist.

Actual: -1.2pp. The truth is in between: type-checker feedback halves the damage vs bare retry (from -2.4 to -1.2) but cannot turn negative to positive. Both hypotheses are partially correct.

Have-Skeleton OPSD
+3.3pp
Structural strategy guidance
Type-Checker PI
-1.2pp
Knows WHY it failed
Bare Retry
-2.4pp
Only knows it failed

Method

  1. First attempt: Generate one proof per theorem from Kimina-1.5B at T=0.7.
  2. Type-check: Run Lean 4 type-checker on each proof. Collect PASS/FAIL status AND the full error message for failures.
  3. Retry with error: For failed theorems, prompt with the specific type-checker error: "Your proof failed with the following error: [ERROR MESSAGE]. Please try a different approach."
  4. Filter: Keep only retry proofs that successfully type-check.
  5. SFT: Fine-tune on (theorem, correct_retry) pairs.

This is identical to bare retry except the retry prompt includes the actual compiler error message, giving the model specific diagnostic information about why its first attempt failed.

Configuration

ModelKimina-1.5B
DatasetFLC-120k
Eval benchmarkMiniF2F (pass@1)
Training steps500
Learning rate2e-5
LoRA rank16
Seed42
VerifierLean 4 type-checker
PI typeType-checker error messages
Retry success rate~11% (slightly better than bare 8%)
Hardware1x H200 (p5en.48xl)
Runtime~14h

Results

MethodBaselinePost-trainingDeltavs Bare Retry
OPSD have-skeleton (structural PI)39.3%42.6%+3.3pp+5.7pp better
Baseline (no training)39.3%39.3%0.0pp+2.4pp better
Type-checker PI retry39.3%38.1%-1.2pp+1.2pp better
Failure-diagnosis PI retry39.3%38.1%-1.2pp+1.2pp better
Bare retry (no PI)39.3%36.9%-2.4ppbaseline
Heuristic 3-iteration39.3%35.7%-3.6pp-1.2pp worse

The type-checker PI halves the damage but cannot turn negative to positive. The error message helps the model avoid the SAME mistake (retry success rate: 11% vs 8%) but does not enable fundamentally different proof strategies. The model still explores minor variations of its original approach rather than switching to a different proof path. Only structural PI (have-skeleton) provides strategy-level guidance.

Why Type-Checker Errors Are Insufficient

  1. Tactical vs strategic feedback: Type-checker errors are local ("this term has wrong type") while proof failures are global (wrong proof strategy). Knowing that Nat.succ_lt_succ failed does not tell the model to try induction instead of case analysis.
  2. Error messages are symptoms, not diagnoses: A type mismatch at step 5 often means the proof structure was wrong at step 1. The model cannot infer the correct high-level change from a low-level error.
  3. Retry still explores locally: Even with the error, the model makes minimal edits (fix the one failing tactic) rather than restructuring the proof. The 11% success rate (vs 8% bare) shows marginal improvement in local fixes, not strategic revision.
  4. Structural PI provides strategy selection: Have-skeleton tells the model to "first prove X, then Y, then combine." This is strategy-level guidance that the type-checker cannot provide. The 4.5pp gap (+3.3 vs -1.2) measures the value of strategic vs tactical feedback.

Comparison with Failure-Diagnosis PI

An LLM-generated failure diagnosis (asking the model to first explain WHY its proof failed, then retry) gives the same -1.2pp result. This shows that the issue is not the FORMAT of the feedback (raw error vs natural language diagnosis) but its LEVEL: tactical feedback at any level of abstraction cannot substitute for structural strategy guidance.

PI TypeSourceLevelResult
Type-checker errorLean compilerTactical (local)-1.2pp
Failure diagnosisLLM self-reflectionTactical (explained)-1.2pp
Have-skeletonTeacher modelStrategic (global)+3.3pp
None (bare retry)-None-2.4pp

What This Resolves

This experiment resolves two open hypotheses from Day 2:

Implication for the Unified Theory

The updated theory has a clean hierarchy for Lean:

  1. Structural PI (strategy guidance): +3.3pp - Tell the model WHICH approach to take
  2. Tactical PI (error feedback): -1.2pp - Tell the model WHERE it went wrong (insufficient)
  3. No PI (bare retry): -2.4pp - Tell the model only that it failed
  4. Iterative heuristic: -3.6pp - Repeat the failure (compounds errors)

For math/code (high path diversity), ANY level of PI works because the model can find alternatives. For Lean (low effective diversity given the type-checker's strictness), only strategy-level PI enables the model to succeed.

Connection to Other Experiments

Lean Bare Retry (-2.4pp) - improved upon but still negative
Type-checker PI halves the regression (from -2.4 to -1.2) by helping the model avoid repeating the exact same error. But tactical improvement is not strategic improvement.
Have-Skeleton OPSD (+3.3pp) - the gap that matters
The 4.5pp gap between type-checker PI (-1.2) and structural PI (+3.3) is the measured value of strategy-level guidance for Lean. This is genuinely privileged information.
Failure-Diagnosis PI (-1.2pp) - identical result, different mechanism
LLM self-reflection on errors gives the same result as raw compiler output. The model's ability to diagnose failure does not improve on the compiler's diagnosis. Both are tactical.
Lean Heuristic 3-Iter (-3.6pp) - progressive degradation
Multiple rounds of retry-with-errors compound damage. Each round's "successful" retries are lower quality, and training on them progressively corrupts the model's proof capabilities.
Math Retry (+8.8pp) - why it works there
In math, ANY feedback (even gibberish) activates alternative reasoning paths because there ARE many valid approaches. In Lean, alternative paths barely exist (given type-checker strictness), so no form of retry can help.