OPSD 5-Seed Null Result

The experiment that killed OPSD. Five seeds of answer-only OPSD average +0.04pp. The original +5.6pp result was a lucky seed. Standard PI distillation has NO reliable effect on math reasoning.

+0.04 +/- 0.9 pp (5 seeds, NULL)
MATH NULL RESULT PAPER-CRITICAL

Hypothesis

OPSD (On-Policy Self-Distillation) distills knowledge from a teacher that has access to privileged information (gold answers). The teacher generates solutions conditioned on PI; the student learns to mimic these solutions without seeing PI. Our initial single-seed result showed +5.6pp, suggesting PI distillation works.

Purpose of this experiment: Verify the +5.6pp result across multiple seeds to confirm reproducibility.

Devastating outcome: Mean of +0.04pp across 5 seeds. The original was a statistical fluke.

Method

OPSD (On-Policy Self-Distillation):

  1. Teacher generation: The student model generates solutions conditioned on the gold answer ("The answer is [X]. Generate a solution.")
  2. KL distillation: The student is trained to match the teacher's output distribution (the PI-conditioned generation) via KL divergence loss, WITHOUT seeing the PI at input time.
  3. On-policy refresh: Every N steps, regenerate teacher outputs from the current student to keep the distillation target fresh.

This is the standard PI distillation framework from our project, applied with the strongest PI form (gold answer).

Configuration

ModelQwen3-1.7B
DatasetNuminaMath-inline-10k
Eval benchmarkMATH-500 (pass@1)
PI formanswer_only
Training steps200
Learning rate2e-5
LoRA rank16
Seeds43, 44, 45, 46, 47
ObjectiveKL distillation
Hardware1x H200 (p5en.48xl)
Runtime~1.5h per seed

Results

SeedBaselinePost-trainingDelta
4341.4%40.0%-1.4pp
4441.4%41.8%+0.4pp
4541.4%41.6%+0.2pp
4641.4%41.4%+0.0pp
4741.4%42.4%+1.0pp
Mean41.4%41.4%+0.04 +/- 0.9pp

OPSD is dead for math. Five seeds span -1.4 to +1.0pp, centering on zero. The original +5.6pp (seed ~42) was a 5-sigma outlier. KL-based PI distillation does not reliably transfer privileged information from teacher to student in the math domain. Retry (+8.8pp) is 220x more effective.

Why OPSD Fails

  1. KL distillation is too indirect: The teacher's distribution shift from seeing the gold answer is subtle (it favors correct reasoning paths). The KL loss measures output-level similarity, which cannot capture the strategic difference.
  2. On-policy mismatch: As the student improves, the teacher's PI-conditioned outputs become less informative (the student is already close). The signal vanishes.
  3. High variance by construction: Small changes in initialization or data ordering change which problems benefit from OPSD, leading to high seed variance around zero.

In contrast, retry works because it (a) generates training examples directly, (b) uses binary verification (not soft KL matching), and (c) naturally targets the frontier.

Training Curves

Stored at: /data/ughai-sandbox/opsd_experiments/opsd_5seed_variance/. All 5 runs show loss decreasing normally (the model IS learning something), but the eval metric does not improve. The model learns to match the teacher's generation style without learning to solve harder problems.

Interpretation

This is the pivotal experiment that redirected the entire research program:

The +5.6pp original was published in earlier internal reports. This 5-seed replication is the correction that led to the "retry is all you need" discovery.

Connection to Other Experiments

Original OPSD +5.6pp (seed ~42) - the lucky seed
The initial result that started the project. Now known to be a statistical outlier, not a reproducible finding.
STaR "Try Again" (+8.8pp) - the replacement
Retry gives 220x more gain than OPSD's true effect. The mechanism matters more than the method.
OPSD PCCG (+3.6pp) - partial rescue
OPSD restricted to frontier problems gives +3.6pp (single seed). The issue may be data dilution, not the distillation framework itself. But still unreliable.
Lean Have-Skeleton OPSD (+3.3pp) - domain exception
OPSD works for Lean because structural PI provides information the model genuinely cannot discover by retry alone. The null is math-specific.
OPSD Random-PI (+1.0pp) - control
Wrong answers as PI in OPSD give +1.0pp. Since correct-PI OPSD averages +0.04pp, even the "distillation mechanics" contribute nearly nothing.