PI Distillation: Executive Summary

3-day research sprint (May 8-10, 2026) — 60+ experiments, 4 domains, 3 model scales

Failure-aware retry + filter + SFT is the dominant self-improvement mechanism for LLMs with verifiers. PI content is irrelevant.

We set out to discover whether Privileged Information (PI) from a teacher/oracle can accelerate student LLM training. The answer: the information does not matter. What matters is the one-bit failure signal ("you got it wrong, try again") combined with a binary verifier to filter correct retries into SFT data. This simple pipeline yields +8-12pp on math, +23pp on code, and compounds across iterations.

The Setup

Question: Can teacher-provided hints, answers, or structured guidance (Privileged Information) help a student LLM learn faster than it could from its own trial-and-error?

Why it matters: If PI content drives learning, we need expensive oracle systems (strong models, proof assistants, execution environments) to generate high-quality guidance. If it does not, the self-improvement recipe simplifies dramatically: just a verifier and a retry prompt.

Setup: Qwen3-1.7B/8B on MATH-500 (primary), Qwen2.5-Coder-1.5B/7B on HumanEval/MBPP, Kimina-1.5B on miniF2F (Lean). We compare gold-answer STaR, gibberish-target STaR, wrong-answer STaR, and bare "try again" STaR, all with 3-seed confidence intervals.

The Headline Result: PI Content Does Not Matter

Four PI conditions, all statistically indistinguishable (Qwen3-1.7B, MATH-500, 500 steps, 3 seeds each):

ConditionPI Given to StudentGain (pp)95% CIInterpretation
SD-Zero (gold answer)Correct final answer+12.0± 0.7Ceiling (oracle)
Gibberish "XYZZY"Nonsense string+10.8± 1.7Same as gold
Wrong answersShuffled (incorrect) answers+9.8± 0.1Tightest CI; PI irrelevant
Bare "try again"Nothing (just retry prompt)+8.8± 1.4No information needed

All CIs overlap. The slight ordering (gold > gibberish > wrong > bare) is not statistically significant. The 3pp spread across conditions is smaller than within-condition variance for most.

The Mechanism: Why Failure Signal Works

The critical control experiment: instead of retrying failed problems, we generate N=16 independent first-attempt solutions and filter correct ones into SFT. This matches compute but removes the failure signal.

Retry (failure-aware)
+10.1pp
vs.
N=16 first-attempt
+1.4pp
Ratio
7x

Interpretation: When the model knows it failed, its retry distribution shifts toward novel strategies it would not explore from scratch. This is not just "more samples" or "sample diversity"; it is a qualitatively different generative mode triggered by awareness of failure. The failure signal concentrates probability mass on unexplored solution paths.

Supporting Evidence

ControlResultWhat It Rules Out
Frontier-rejection N=16+1.2ppNot just "harder problems" (same frontier, no failure signal)
Temperature schedule (no failure)+1.2ppNot diversity alone (varied temps without failure awareness)
Uniform retry T=0.7 x8+2.6ppNot repeated sampling (8 tries without failure framing still 4x worse)
Graded verifier (partial credit)+4.2ppBinary filter is better; near-miss solutions dilute signal
Binary filter (standard retry)+12.9ppBinary pass/fail is a feature, not a limitation

Domain Boundary: Where Retry Fails

MathCode Retry Works Brilliantly

Math: +8.8pp (1.7B), +23.2pp code (HumanEval). These domains have high path diversity; many correct solutions exist and the model can explore them after failure.

Lean Retry Fails; Structural PI Required

Lean theorem proving: retry gives -2.4pp (heuristic verifier) or -1.2pp (type-checker). Iterating makes it worse: 3 rounds reach -3.6pp. Even compiler error diagnostics do not help.

What works instead: OPSD with have-skeleton PI (subgoal decomposition) gives +3.3pp. Lean has low path diversity (proofs are structurally constrained) and strict verification. The model cannot explore novel paths by retrying; it needs structural scaffolding to know where to go.

Takeaway: Retry works when (1) path diversity is high and (2) the verifier is reliable. Lean violates both: narrow paths + noisy heuristic verifiers compound errors.

Scaling: The Inverted-U

Retry gain depends on baseline competence. Too weak (cannot solve anything on retry) or too strong (nothing left to solve) both yield zero gain. The sweet spot is 30-55% baseline pass rate.

+1.9
MBPP 5.8%
+8.8
MATH 37%
+23.2
HumanEval 54%
+11.8
8B Full-FT 35%
-0.4
7B MBPP 66%
+0.6
32B MATH 68%
Left tail (too weak) Sweet spot (30-55%) Right tail (too competent)

8B Scaling Note

LoRA at 8B is nearly inert (+0.0 to +2.8pp depending on LR). The rank-limited adapter cannot shift the model's distribution sufficiently. Full fine-tuning restores the effect (~+11.8pp provisional), confirming this is a plasticity bottleneck, not a fundamental scaling limit.

The 32B result (+0.6pp) is competence saturation (baseline 68.3%), NOT a scaling ceiling. A 32B model on a harder benchmark where its baseline is 30-55% would likely show large gains.

Practitioner Recipe

  1. Measure baseline pass rate on your target benchmark. If it is below 20% or above 60%, retry will not help much. Pick a benchmark in the 30-55% sweet spot.
  2. Generate attempts, filter failures. Run greedy or low-temperature generation. Identify problems the model gets wrong.
  3. Retry failed problems with the prompt "Try again" (or any text; content is irrelevant). Generate 1-8 retries per failed problem at temperature 0.7-1.0.
  4. Binary filter: Keep only retries that pass your verifier (exact match for math, test suite for code). Discard partial credit or near-misses.
  5. SFT on correct retries. Standard supervised fine-tuning for 200-500 steps. Use full fine-tuning for models larger than ~4B parameters (LoRA is insufficient).
  6. Iterate once. Repeat steps 2-5 from the new checkpoint. Expect ~40-50% of round-1 gain. Do NOT iterate a third time (round 3 regresses).

What Does Not Work