Frequently Asked Questions

Common questions (and skeptical ones) about PI distillation via retry

Validity

"Doesn't the model just memorize the training problems?"

No. Training and evaluation use completely disjoint datasets.

Training: NuminaMath-CoT-10k (10k problems)
Evaluation: MATH-500 (held-out, zero overlap)

The model never sees MATH-500 problems during training. All reported gains (+10pp on 1.7B) reflect generalization to unseen problems, not memorization of training data.

"How is this different from rejection sampling?"

Rejection sampling (best-of-N on first attempts) only gives +1.4pp. Retry gives +10pp. The crucial difference is the failure signal.

In rejection sampling, you generate N independent attempts and keep the best. The model never learns from failure. In retry-based PI distillation, the model first fails, receives a "try again" signal, and then succeeds. Training on these retry-successes teaches the model self-correction behavior that transfers to single-shot evaluation.

Rejection sampling (best-of-N): +1.4pp
Retry-based PI distillation: +10.0pp
Gap: 7x more effective

Scale

"Does this work at larger scales?"

Yes, with caveats.

At 8B with full fine-tuning: +9.6pp (strong). With LoRA at 8B: only +2.8pp (LoRA is insufficient for internalizing retry behavior at this scale).

At 32B, MATH-500 becomes too easy (68% baseline) so the model sits in the right tail of its capability distribution. There are not enough problems where it fails-then-succeeds to generate meaningful training signal.

1.7B full FT: +10.0pp
8B full FT: +9.6pp
8B LoRA: +2.8pp (insufficient)
32B: ceiling effect (68% baseline)
"Can you iterate (multiple rounds)?"

Yes, up to 2 rounds.

Round 1 gives +10pp. Round 2 gives +14.5pp total. Round 3 regresses. The model's improved baseline after round 1 means fewer fail-then-succeed examples are available, and by round 3 you are fitting noise.

Round 1: +10.0pp
Round 2: +14.5pp (cumulative)
Round 3: regresses (stop here)

Practical advice: stop at 2 rounds.

Mechanism

"Is K=1 retry sufficient?"

K=3 is optimal.

K=1 gives +8.8pp. K=3 gives +10.1pp. K=5 overfits (too many retries dilute data quality with lucky guesses). The sweet spot is 3 retry attempts per problem.

K=1: +8.8pp
K=3: +10.1pp (optimal)
K=5: overfits
"Does the prompt content matter?"

No. The content of the retry prompt is irrelevant. Only the binary pass/fail signal matters.

We tested: gold answer in prompt, gibberish, a wrong answer, and a bare "try again." All produce statistically identical gains (within noise). The model is not extracting information from the retry prompt; it is learning from the structure of the interaction (fail, signal, succeed).

Gold answer in prompt: same gain
Gibberish prompt: same gain
Wrong answer prompt: same gain
Bare "try again": same gain
Conclusion: only binary pass/fail matters
"What about using a teacher model?"

Self-retry dominates cross-model retry.

Using an 8B model to generate retries for 1.7B gives only +4.2pp, versus self-retry at +8.8pp. The self-correction patterns are model-specific. A larger teacher's correction style does not transfer well to the student's representational space.

Self-retry (1.7B for 1.7B): +8.8pp
Cross-model (8B for 1.7B): +4.2pp
Self-correction is model-specific

Limitations

"Why doesn't it work on Lean theorem proving?"

Two compounding issues:

1. Low solution diversity. Mathematical proofs are structurally constrained. Unlike natural-language math solutions (where rephrasing is easy), Lean proofs for the same theorem tend to follow the same structure. The model has fewer distinct "angles of attack" to find after failure.

2. Heuristic verifier noise. Even with the Lean type-checker as ground truth, results are -1.2pp. The structural constraint is the binding issue, not verification quality.

Lean with type-checker verification: -1.2pp
Solution: structural PI (skeleton-based) instead of retry-based

The path forward for formal math is skeleton-based PI distillation, where the structural signal replaces the retry signal.