OPSD 5-Seed Null Result
The experiment that killed OPSD. Five seeds of answer-only OPSD average +0.04pp. The original +5.6pp result was a lucky seed. Standard PI distillation has NO reliable effect on math reasoning.
+0.04 +/- 0.9 pp (5 seeds, NULL)
MATH
NULL RESULT
PAPER-CRITICAL
Hypothesis
OPSD (On-Policy Self-Distillation) distills knowledge from a teacher that has access to privileged information (gold answers). The teacher generates solutions conditioned on PI; the student learns to mimic these solutions without seeing PI. Our initial single-seed result showed +5.6pp, suggesting PI distillation works.
Purpose of this experiment: Verify the +5.6pp result across multiple seeds to confirm reproducibility.
Devastating outcome: Mean of +0.04pp across 5 seeds. The original was a statistical fluke.
Method
OPSD (On-Policy Self-Distillation):
- Teacher generation: The student model generates solutions conditioned on the gold answer ("The answer is [X]. Generate a solution.")
- KL distillation: The student is trained to match the teacher's output distribution (the PI-conditioned generation) via KL divergence loss, WITHOUT seeing the PI at input time.
- On-policy refresh: Every N steps, regenerate teacher outputs from the current student to keep the distillation target fresh.
This is the standard PI distillation framework from our project, applied with the strongest PI form (gold answer).
Configuration
ModelQwen3-1.7B
DatasetNuminaMath-inline-10k
Eval benchmarkMATH-500 (pass@1)
PI formanswer_only
Training steps200
Learning rate2e-5
LoRA rank16
Seeds43, 44, 45, 46, 47
ObjectiveKL distillation
Hardware1x H200 (p5en.48xl)
Runtime~1.5h per seed
Results
| Seed | Baseline | Post-training | Delta |
| 43 | 41.4% | 40.0% | -1.4pp |
| 44 | 41.4% | 41.8% | +0.4pp |
| 45 | 41.4% | 41.6% | +0.2pp |
| 46 | 41.4% | 41.4% | +0.0pp |
| 47 | 41.4% | 42.4% | +1.0pp |
| Mean | 41.4% | 41.4% | +0.04 +/- 0.9pp |
OPSD is dead for math. Five seeds span -1.4 to +1.0pp, centering on zero. The original +5.6pp (seed ~42) was a 5-sigma outlier. KL-based PI distillation does not reliably transfer privileged information from teacher to student in the math domain. Retry (+8.8pp) is 220x more effective.
Why OPSD Fails
- KL distillation is too indirect: The teacher's distribution shift from seeing the gold answer is subtle (it favors correct reasoning paths). The KL loss measures output-level similarity, which cannot capture the strategic difference.
- On-policy mismatch: As the student improves, the teacher's PI-conditioned outputs become less informative (the student is already close). The signal vanishes.
- High variance by construction: Small changes in initialization or data ordering change which problems benefit from OPSD, leading to high seed variance around zero.
In contrast, retry works because it (a) generates training examples directly, (b) uses binary verification (not soft KL matching), and (c) naturally targets the frontier.
Training Curves
Stored at: /data/ughai-sandbox/opsd_experiments/opsd_5seed_variance/. All 5 runs show loss decreasing normally (the model IS learning something), but the eval metric does not improve. The model learns to match the teacher's generation style without learning to solve harder problems.
Interpretation
This is the pivotal experiment that redirected the entire research program:
- Before this result: We believed OPSD worked (+5.6pp) and were optimizing PI forms (hint specificity, answer vs strategy, etc.)
- After this result: OPSD is a null result. The entire PI distillation framework is unreliable for math. We pivoted to understanding WHY retry methods work instead.
The +5.6pp original was published in earlier internal reports. This 5-seed replication is the correction that led to the "retry is all you need" discovery.
Connection to Other Experiments
Original OPSD +5.6pp (seed ~42) - the lucky seed
The initial result that started the project. Now known to be a statistical outlier, not a reproducible finding.
STaR "Try Again" (+8.8pp) - the replacement
Retry gives 220x more gain than OPSD's true effect. The mechanism matters more than the method.
OPSD PCCG (+3.6pp) - partial rescue
OPSD restricted to frontier problems gives +3.6pp (single seed). The issue may be data dilution, not the distillation framework itself. But still unreliable.
Lean Have-Skeleton OPSD (+3.3pp) - domain exception
OPSD works for Lean because structural PI provides information the model genuinely cannot discover by retry alone. The null is math-specific.
OPSD Random-PI (+1.0pp) - control
Wrong answers as PI in OPSD give +1.0pp. Since correct-PI OPSD averages +0.04pp, even the "distillation mechanics" contribute nearly nothing.