PI Distillation: FAQ

Validity

"Doesn't the model just memorize the training problems?"

No. Training and evaluation use completely disjoint datasets.

  Training: NuminaMath-CoT-10k (10k problems)

  Evaluation: MATH-500 (held-out, zero overlap)

The model never sees MATH-500 problems during training. All reported gains (+10pp on 1.7B) reflect generalization to unseen problems, not memorization of training data.

"How is this different from rejection sampling?"

Rejection sampling (best-of-N on first attempts) only gives +1.4pp. Retry gives +10pp. The crucial difference is the failure signal.

In rejection sampling, you generate N independent attempts and keep the best. The model never learns from failure. In retry-based PI distillation, the model first fails, receives a "try again" signal, and then succeeds. Training on these retry-successes teaches the model self-correction behavior that transfers to single-shot evaluation.

  Rejection sampling (best-of-N): +1.4pp

  Retry-based PI distillation: +10.0pp

  Gap: 7x more effective

Scale

"Does this work at larger scales?"

Yes, with caveats.

At 8B with full fine-tuning: +9.6pp (strong). With LoRA at 8B: only +2.8pp (LoRA is insufficient for internalizing retry behavior at this scale).

At 32B, MATH-500 becomes too easy (68% baseline) so the model sits in the right tail of its capability distribution. There are not enough problems where it fails-then-succeeds to generate meaningful training signal.

  1.7B full FT: +10.0pp

  8B full FT: +9.6pp

  8B LoRA: +2.8pp (insufficient)

  32B: ceiling effect (68% baseline)

"Can you iterate (multiple rounds)?"

Yes, up to 2 rounds.

Round 1 gives +10pp. Round 2 gives +14.5pp total. Round 3 regresses. The model's improved baseline after round 1 means fewer fail-then-succeed examples are available, and by round 3 you are fitting noise.

  Round 1: +10.0pp

  Round 2: +14.5pp (cumulative)

  Round 3: regresses (stop here)

Practical advice: stop at 2 rounds.

Mechanism

"Is K=1 retry sufficient?"

K=3 is optimal.

K=1 gives +8.8pp. K=3 gives +10.1pp. K=5 overfits (too many retries dilute data quality with lucky guesses). The sweet spot is 3 retry attempts per problem.

  K=1: +8.8pp

  K=3: +10.1pp (optimal)

  K=5: overfits

"Does the prompt content matter?"

No. The content of the retry prompt is irrelevant. Only the binary pass/fail signal matters.

We tested: gold answer in prompt, gibberish, a wrong answer, and a bare "try again." All produce statistically identical gains (within noise). The model is not extracting information from the retry prompt; it is learning from the structure of the interaction (fail, signal, succeed).

  Gold answer in prompt: same gain

  Gibberish prompt: same gain

  Wrong answer prompt: same gain

  Bare "try again": same gain

  Conclusion: only binary pass/fail matters

"What about using a teacher model?"

Self-retry dominates cross-model retry.

Using an 8B model to generate retries for 1.7B gives only +4.2pp, versus self-retry at +8.8pp. The self-correction patterns are model-specific. A larger teacher's correction style does not transfer well to the student's representational space.

  Self-retry (1.7B for 1.7B): +8.8pp

  Cross-model (8B for 1.7B): +4.2pp

  Self-correction is model-specific

Limitations

"Why doesn't it work on Lean theorem proving?"

Two compounding issues:

1. Low solution diversity. Mathematical proofs are structurally constrained. Unlike natural-language math solutions (where rephrasing is easy), Lean proofs for the same theorem tend to follow the same structure. The model has fewer distinct "angles of attack" to find after failure.

2. Heuristic verifier noise. Even with the Lean type-checker as ground truth, results are -1.2pp. The structural constraint is the binding issue, not verification quality.

  Lean with type-checker verification: -1.2pp

  Solution: structural PI (skeleton-based) instead of retry-based

The path forward for formal math is skeleton-based PI distillation, where the structural signal replaces the retry signal.

Frequently Asked Questions