PI Distillation: Executive Summary

3-day research sprint (May 8-10, 2026) — 60+ experiments, 4 domains, 3 model scales

Failure-aware retry + filter + SFT is the dominant self-improvement mechanism for LLMs with verifiers. PI content is irrelevant.

We set out to discover whether Privileged Information (PI) from a teacher/oracle can accelerate student LLM training. The answer: the information does not matter. What matters is the one-bit failure signal ("you got it wrong, try again") combined with a binary verifier to filter correct retries into SFT data. This simple pipeline yields +8-12pp on math, +23pp on code, and compounds across iterations.

The Setup

Question: Can teacher-provided hints, answers, or structured guidance (Privileged Information) help a student LLM learn faster than it could from its own trial-and-error?

Why it matters: If PI content drives learning, we need expensive oracle systems (strong models, proof assistants, execution environments) to generate high-quality guidance. If it does not, the self-improvement recipe simplifies dramatically: just a verifier and a retry prompt.

Setup: Qwen3-1.7B/8B on MATH-500 (primary), Qwen2.5-Coder-1.5B/7B on HumanEval/MBPP, Kimina-1.5B on miniF2F (Lean). We compare gold-answer STaR, gibberish-target STaR, wrong-answer STaR, and bare "try again" STaR, all with 3-seed confidence intervals.

The Headline Result: PI Content Does Not Matter

Four PI conditions, all statistically indistinguishable (Qwen3-1.7B, MATH-500, 500 steps, 3 seeds each):

Condition	PI Given to Student	Gain (pp)	95% CI	Interpretation
SD-Zero (gold answer)	Correct final answer	+12.0	± 0.7	Ceiling (oracle)
Gibberish "XYZZY"	Nonsense string	+10.8	± 1.7	Same as gold
Wrong answers	Shuffled (incorrect) answers	+9.8	± 0.1	Tightest CI; PI irrelevant
Bare "try again"	Nothing (just retry prompt)	+8.8	± 1.4	No information needed

All CIs overlap. The slight ordering (gold > gibberish > wrong > bare) is not statistically significant. The 3pp spread across conditions is smaller than within-condition variance for most.

The Mechanism: Why Failure Signal Works

The critical control experiment: instead of retrying failed problems, we generate N=16 independent first-attempt solutions and filter correct ones into SFT. This matches compute but removes the failure signal.

Retry (failure-aware)
+10.1pp

vs.

N=16 first-attempt

+1.4pp

→

Ratio
7x

Interpretation: When the model knows it failed, its retry distribution shifts toward novel strategies it would not explore from scratch. This is not just "more samples" or "sample diversity"; it is a qualitatively different generative mode triggered by awareness of failure. The failure signal concentrates probability mass on unexplored solution paths.

Supporting Evidence

Control	Result	What It Rules Out
Frontier-rejection N=16	+1.2pp	Not just "harder problems" (same frontier, no failure signal)
Temperature schedule (no failure)	+1.2pp	Not diversity alone (varied temps without failure awareness)
Uniform retry T=0.7 x8	+2.6pp	Not repeated sampling (8 tries without failure framing still 4x worse)
Graded verifier (partial credit)	+4.2pp	Binary filter is better; near-miss solutions dilute signal
Binary filter (standard retry)	+12.9pp	Binary pass/fail is a feature, not a limitation

Domain Boundary: Where Retry Fails

MathCode Retry Works Brilliantly

Math: +8.8pp (1.7B), +23.2pp code (HumanEval). These domains have high path diversity; many correct solutions exist and the model can explore them after failure.

Lean Retry Fails; Structural PI Required

Lean theorem proving: retry gives -2.4pp (heuristic verifier) or -1.2pp (type-checker). Iterating makes it worse: 3 rounds reach -3.6pp. Even compiler error diagnostics do not help.

What works instead: OPSD with have-skeleton PI (subgoal decomposition) gives +3.3pp. Lean has low path diversity (proofs are structurally constrained) and strict verification. The model cannot explore novel paths by retrying; it needs structural scaffolding to know where to go.

Takeaway: Retry works when (1) path diversity is high and (2) the verifier is reliable. Lean violates both: narrow paths + noisy heuristic verifiers compound errors.

Scaling: The Inverted-U

Retry gain depends on baseline competence. Too weak (cannot solve anything on retry) or too strong (nothing left to solve) both yield zero gain. The sweet spot is 30-55% baseline pass rate.

+1.9

MBPP 5.8%

+8.8

MATH 37%

+23.2

HumanEval 54%

+11.8

8B Full-FT 35%

-0.4

7B MBPP 66%

+0.6

32B MATH 68%

Left tail (too weak) Sweet spot (30-55%) Right tail (too competent)

8B Scaling Note

LoRA at 8B is nearly inert (+0.0 to +2.8pp depending on LR). The rank-limited adapter cannot shift the model's distribution sufficiently. Full fine-tuning restores the effect (~+11.8pp provisional), confirming this is a plasticity bottleneck, not a fundamental scaling limit.

The 32B result (+0.6pp) is competence saturation (baseline 68.3%), NOT a scaling ceiling. A 32B model on a harder benchmark where its baseline is 30-55% would likely show large gains.

Practitioner Recipe

Measure baseline pass rate on your target benchmark. If it is below 20% or above 60%, retry will not help much. Pick a benchmark in the 30-55% sweet spot.
Generate attempts, filter failures. Run greedy or low-temperature generation. Identify problems the model gets wrong.
Retry failed problems with the prompt "Try again" (or any text; content is irrelevant). Generate 1-8 retries per failed problem at temperature 0.7-1.0.
Binary filter: Keep only retries that pass your verifier (exact match for math, test suite for code). Discard partial credit or near-misses.
SFT on correct retries. Standard supervised fine-tuning for 200-500 steps. Use full fine-tuning for models larger than ~4B parameters (LoRA is insufficient).
Iterate once. Repeat steps 2-5 from the new checkpoint. Expect ~40-50% of round-1 gain. Do NOT iterate a third time (round 3 regresses).

What Does Not Work

OPSD (privileged distillation): +0.04pp mean over 5 seeds. The original +5.6pp was a lucky seed. Null result.
Scout-RL Phase B (RL from PI): Consistently negative (-0.6 to -1.2pp). Distilling from teacher revisions degrades the student.
First-step hints: +3.6pp, which is LOWER than bare retry (+8.8pp). Specifying an approach constrains exploration.
Graded/partial-credit verifiers: +4.2pp vs +12.9pp binary. Near-miss solutions dilute the training signal.
Rejection sampling (no failure signal): -2.0pp. SFT on the model's own correct first-attempts hurts performance.
RL approaches (GRPO, IPRS): +1.4 to +2.1pp. Vastly outperformed by simple retry+SFT.
Contrastive methods (DPO): -0.2pp. PI-ranked preference pairs do not help.
Dynamic curriculum: Worse than static frontier selection. Overhead exceeds benefit.
Hint RL: Catastrophic collapse (-38.2pp). RL with binary reward plus hints causes entropy explosion.