PI Distillation: The Story of Discovery

3-day research sprint, May 9-11, 2026

Breakthrough
Confirmatory
Null / Negative
Theoretical Revision
Day 1 — May 9, 2026
OPSD 5-Seed Null Confirmed +0.04pp
Baseline OPSD shows no meaningful gain across 5 seeds. The method as originally conceived is dead; something else is doing the work.
STaR Try-Again +8.8pp
Simply retrying after failure yields massive gains. The retry signal, not PI content, is the active ingredient in self-improvement loops.
Wrong Answers as PI +9.8pp
Feeding wrong answers as "privileged information" works nearly as well as correct ones. PI content is irrelevant; the signal comes from the retry structure itself.
Gibberish as PI +10.8pp
Even random gibberish tokens produce gains. This definitively proves the mechanism is retry-induced distribution shift, not information transfer.
Code Retry +23.2pp
Code domain shows the largest retry effect. Structured outputs with clear pass/fail signals amplify the retry mechanism dramatically.
Double-Sample vs Retry +1.4pp
Double-sampling (no failure signal) yields only +1.4pp vs retry's +8.8pp. Failure signal is 7x more effective than mere diversity, confirming selective pressure as the mechanism.
Graded vs Binary Verifier +4.2pp vs +12.9pp
Graded verifier (+4.2pp) underperforms binary (+12.9pp). Harsh binary rejection creates stronger selective pressure than nuanced feedback.
Lean Have-Skeleton (Genuine PI) +3.3pp
Providing proof skeletons in Lean shows genuine PI transfer. Lean is the one domain where structured hints actually help beyond retry alone.
Day 2 — May 10, 2026
Lean Retry -2.4pp
Retry actually hurts in Lean! Unlike code, Lean's type system makes "just try again" counterproductive. The failure mode is conceptual, not stochastic.
8B LR Sweep 0 / +2.8 / +1.0pp
Learning rate sensitivity resolved: lr=1e-6 is optimal for 8B. Too high destabilizes, too low undertrains. Standard hyperparameter finding.
Iterative 2 Rounds +14.5pp
Running the retry loop for 2 iterations compounds gains. Each round filters harder, pushing the model further along the success distribution.
Frontier-Rejection Mechanism +1.2pp
Quantified the frontier-rejection contribution at +1.2pp. Confirms that rejection sampling at the capability boundary adds a modest but real boost.
Temperature Schedule +1.2 / +2.6pp
Temperature annealing helps, but primarily through increased failure signal (higher temp = more failures = stronger selective pressure), not through diversity.
OPSD Dynamic Curriculum +1.6pp
Dynamic curriculum scheduling underperforms static selection. Adaptive difficulty doesn't help when the mechanism is binary pass/fail filtering.
Path Diversity Surprise comparable
Both code and Lean show similar path diversity metrics. The difference in retry effectiveness isn't about solution space structure but about error signal informativeness.
Day 3 — May 11, 2026
Lean Type-Checker Feedback -1.2pp
Adding type-checker errors as feedback still hurts. The verifier signal helps selection, but retry remains negative in Lean regardless of error detail.
Lean Failure-Diagnosis -1.2pp
Explicit failure diagnosis doesn't help either. Lean errors are conceptual dead-ends, not surface mistakes; more error info can't fix wrong proof strategies.
8B Full Fine-Tune (verified) +9.6pp
Full fine-tuning crushes LoRA: +9.6pp verified. LoRA was the bottleneck all along. The retry signal needs full parameter updates to be absorbed properly.
K-Sweep: K=3 Optimal +10.1pp
K=3 retries is the sweet spot (+10.1pp). K=5 regresses, likely due to training on increasingly desperate/low-quality attempts that corrupt the signal.
Cross-Model Retry +4.2pp
Cross-model retry works (+4.2pp) but self-retry is 2x better. The model's own failure distribution is more informative for its own improvement.
Iterative 3 Rounds: R3 Regresses regression
Third iteration hurts. Two rounds is the optimum; beyond that, the model overfits to the filtered distribution and loses generalization.

Key Takeaways

20
Experiments Run
+23.2pp
Best Single Gain (Code Retry)
7x
Failure Signal vs Diversity
K=3, R=2
Optimal Config