OPSD 5-Seed Null Result

The experiment that killed OPSD. Five seeds of answer-only OPSD average +0.04pp. The original +5.6pp result was a lucky seed. Standard PI distillation has NO reliable effect on math reasoning.

+0.04 +/- 0.9 pp (5 seeds, NULL)

MATH NULL RESULT PAPER-CRITICAL

Hypothesis

OPSD (On-Policy Self-Distillation) distills knowledge from a teacher that has access to privileged information (gold answers). The teacher generates solutions conditioned on PI; the student learns to mimic these solutions without seeing PI. Our initial single-seed result showed +5.6pp, suggesting PI distillation works.

Purpose of this experiment: Verify the +5.6pp result across multiple seeds to confirm reproducibility.

Devastating outcome: Mean of +0.04pp across 5 seeds. The original was a statistical fluke.

Method

OPSD (On-Policy Self-Distillation):

Teacher generation: The student model generates solutions conditioned on the gold answer ("The answer is [X]. Generate a solution.")
KL distillation: The student is trained to match the teacher's output distribution (the PI-conditioned generation) via KL divergence loss, WITHOUT seeing the PI at input time.
On-policy refresh: Every N steps, regenerate teacher outputs from the current student to keep the distillation target fresh.

This is the standard PI distillation framework from our project, applied with the strongest PI form (gold answer).

Configuration

ModelQwen3-1.7B

DatasetNuminaMath-inline-10k

Eval benchmarkMATH-500 (pass@1)

PI formanswer_only

Training steps200

Learning rate2e-5

LoRA rank16

Seeds43, 44, 45, 46, 47

ObjectiveKL distillation

Hardware1x H200 (p5en.48xl)

Runtime~1.5h per seed

Results

Seed	Baseline	Post-training	Delta
43	41.4%	40.0%	-1.4pp
44	41.4%	41.8%	+0.4pp
45	41.4%	41.6%	+0.2pp
46	41.4%	41.4%	+0.0pp
47	41.4%	42.4%	+1.0pp
Mean	41.4%	41.4%	+0.04 +/- 0.9pp

OPSD is dead for math. Five seeds span -1.4 to +1.0pp, centering on zero. The original +5.6pp (seed ~42) was a 5-sigma outlier. KL-based PI distillation does not reliably transfer privileged information from teacher to student in the math domain. Retry (+8.8pp) is 220x more effective.

Why OPSD Fails

KL distillation is too indirect: The teacher's distribution shift from seeing the gold answer is subtle (it favors correct reasoning paths). The KL loss measures output-level similarity, which cannot capture the strategic difference.
On-policy mismatch: As the student improves, the teacher's PI-conditioned outputs become less informative (the student is already close). The signal vanishes.
High variance by construction: Small changes in initialization or data ordering change which problems benefit from OPSD, leading to high seed variance around zero.

In contrast, retry works because it (a) generates training examples directly, (b) uses binary verification (not soft KL matching), and (c) naturally targets the frontier.

Training Curves

Stored at: /data/ughai-sandbox/opsd_experiments/opsd_5seed_variance/. All 5 runs show loss decreasing normally (the model IS learning something), but the eval metric does not improve. The model learns to match the teacher's generation style without learning to solve harder problems.

Interpretation

This is the pivotal experiment that redirected the entire research program:

Before this result: We believed OPSD worked (+5.6pp) and were optimizing PI forms (hint specificity, answer vs strategy, etc.)
After this result: OPSD is a null result. The entire PI distillation framework is unreliable for math. We pivoted to understanding WHY retry methods work instead.

The +5.6pp original was published in earlier internal reports. This 5-seed replication is the correction that led to the "retry is all you need" discovery.

Connection to Other Experiments

Original OPSD +5.6pp (seed ~42) - the lucky seed

The initial result that started the project. Now known to be a statistical outlier, not a reproducible finding.

STaR "Try Again" (+8.8pp) - the replacement

Retry gives 220x more gain than OPSD's true effect. The mechanism matters more than the method.

OPSD PCCG (+3.6pp) - partial rescue

OPSD restricted to frontier problems gives +3.6pp (single seed). The issue may be data dilution, not the distillation framework itself. But still unreliable.

Lean Have-Skeleton OPSD (+3.3pp) - domain exception

OPSD works for Lean because structural PI provides information the model genuinely cannot discover by retry alone. The null is math-specific.

OPSD Random-PI (+1.0pp) - control

Wrong answers as PI in OPSD give +1.0pp. Since correct-PI OPSD averages +0.04pp, even the "distillation mechanics" contribute nearly nothing.