Common questions (and skeptical ones) about PI distillation via retry
Validity
No. Training and evaluation use completely disjoint datasets.
NuminaMath-CoT-10k (10k problems)MATH-500 (held-out, zero overlap)
The model never sees MATH-500 problems during training. All reported gains (+10pp on 1.7B) reflect generalization to unseen problems, not memorization of training data.
Rejection sampling (best-of-N on first attempts) only gives +1.4pp. Retry gives +10pp. The crucial difference is the failure signal.
In rejection sampling, you generate N independent attempts and keep the best. The model never learns from failure. In retry-based PI distillation, the model first fails, receives a "try again" signal, and then succeeds. Training on these retry-successes teaches the model self-correction behavior that transfers to single-shot evaluation.
Scale
Yes, with caveats.
At 8B with full fine-tuning: +9.6pp (strong). With LoRA at 8B: only +2.8pp (LoRA is insufficient for internalizing retry behavior at this scale).
At 32B, MATH-500 becomes too easy (68% baseline) so the model sits in the right tail of its capability distribution. There are not enough problems where it fails-then-succeeds to generate meaningful training signal.
Yes, up to 2 rounds.
Round 1 gives +10pp. Round 2 gives +14.5pp total. Round 3 regresses. The model's improved baseline after round 1 means fewer fail-then-succeed examples are available, and by round 3 you are fitting noise.
Practical advice: stop at 2 rounds.
Mechanism
K=3 is optimal.
K=1 gives +8.8pp. K=3 gives +10.1pp. K=5 overfits (too many retries dilute data quality with lucky guesses). The sweet spot is 3 retry attempts per problem.
No. The content of the retry prompt is irrelevant. Only the binary pass/fail signal matters.
We tested: gold answer in prompt, gibberish, a wrong answer, and a bare "try again." All produce statistically identical gains (within noise). The model is not extracting information from the retry prompt; it is learning from the structure of the interaction (fail, signal, succeed).
Self-retry dominates cross-model retry.
Using an 8B model to generate retries for 1.7B gives only +4.2pp, versus self-retry at +8.8pp. The self-correction patterns are model-specific. A larger teacher's correction style does not transfer well to the student's representational space.
Limitations
Two compounding issues:
1. Low solution diversity. Mathematical proofs are structurally constrained. Unlike natural-language math solutions (where rephrasing is easy), Lean proofs for the same theorem tend to follow the same structure. The model has fewer distinct "angles of attack" to find after failure.
2. Heuristic verifier noise. Even with the Lean type-checker as ground truth, results are -1.2pp. The structural constraint is the binding issue, not verification quality.
The path forward for formal math is skeleton-based PI distillation, where the structural signal replaces the retry signal.