PI Distillation: The Story of Discovery

3-day research sprint, May 9-11, 2026

Breakthrough

Confirmatory

Null / Negative

Theoretical Revision

Day 1 — May 9, 2026

OPSD 5-Seed Null Confirmed +0.04pp

Baseline OPSD shows no meaningful gain across 5 seeds. The method as originally conceived is dead; something else is doing the work.

STaR Try-Again +8.8pp

Simply retrying after failure yields massive gains. The retry signal, not PI content, is the active ingredient in self-improvement loops.

Wrong Answers as PI +9.8pp

Feeding wrong answers as "privileged information" works nearly as well as correct ones. PI content is irrelevant; the signal comes from the retry structure itself.

Gibberish as PI +10.8pp

Even random gibberish tokens produce gains. This definitively proves the mechanism is retry-induced distribution shift, not information transfer.

Code Retry +23.2pp

Code domain shows the largest retry effect. Structured outputs with clear pass/fail signals amplify the retry mechanism dramatically.

Double-Sample vs Retry +1.4pp

Double-sampling (no failure signal) yields only +1.4pp vs retry's +8.8pp. Failure signal is 7x more effective than mere diversity, confirming selective pressure as the mechanism.

Graded vs Binary Verifier +4.2pp vs +12.9pp

Graded verifier (+4.2pp) underperforms binary (+12.9pp). Harsh binary rejection creates stronger selective pressure than nuanced feedback.

Lean Have-Skeleton (Genuine PI) +3.3pp

Providing proof skeletons in Lean shows genuine PI transfer. Lean is the one domain where structured hints actually help beyond retry alone.

Day 2 — May 10, 2026

Lean Retry -2.4pp

Retry actually hurts in Lean! Unlike code, Lean's type system makes "just try again" counterproductive. The failure mode is conceptual, not stochastic.

8B LR Sweep 0 / +2.8 / +1.0pp

Learning rate sensitivity resolved: lr=1e-6 is optimal for 8B. Too high destabilizes, too low undertrains. Standard hyperparameter finding.

Iterative 2 Rounds +14.5pp

Running the retry loop for 2 iterations compounds gains. Each round filters harder, pushing the model further along the success distribution.

Frontier-Rejection Mechanism +1.2pp

Quantified the frontier-rejection contribution at +1.2pp. Confirms that rejection sampling at the capability boundary adds a modest but real boost.

Temperature Schedule +1.2 / +2.6pp

Temperature annealing helps, but primarily through increased failure signal (higher temp = more failures = stronger selective pressure), not through diversity.

OPSD Dynamic Curriculum +1.6pp

Dynamic curriculum scheduling underperforms static selection. Adaptive difficulty doesn't help when the mechanism is binary pass/fail filtering.

Path Diversity Surprise comparable

Both code and Lean show similar path diversity metrics. The difference in retry effectiveness isn't about solution space structure but about error signal informativeness.

Day 3 — May 11, 2026

Lean Type-Checker Feedback -1.2pp

Adding type-checker errors as feedback still hurts. The verifier signal helps selection, but retry remains negative in Lean regardless of error detail.

Lean Failure-Diagnosis -1.2pp

Explicit failure diagnosis doesn't help either. Lean errors are conceptual dead-ends, not surface mistakes; more error info can't fix wrong proof strategies.

8B Full Fine-Tune (verified) +9.6pp

Full fine-tuning crushes LoRA: +9.6pp verified. LoRA was the bottleneck all along. The retry signal needs full parameter updates to be absorbed properly.

K-Sweep: K=3 Optimal +10.1pp

K=3 retries is the sweet spot (+10.1pp). K=5 regresses, likely due to training on increasingly desperate/low-quality attempts that corrupt the signal.

Cross-Model Retry +4.2pp

Cross-model retry works (+4.2pp) but self-retry is 2x better. The model's own failure distribution is more informative for its own improvement.

Iterative 3 Rounds: R3 Regresses regression

Third iteration hurts. Two rounds is the optimum; beyond that, the model overfits to the filtered distribution and loses generalization.

Key Takeaways

Experiments Run

+23.2pp

Best Single Gain (Code Retry)

Failure Signal vs Diversity

K=3, R=2

Optimal Config