Gold-Answer STaR / SD-Zero (3-seed)

The ceiling condition. Model revises with the gold answer visible, producing "rationalized" solutions. Sets the upper bound for all retry-based methods at +12.0pp with tight reproducibility.

+12.0 +/- 0.7 pp (3 seeds)
MATH 3-SEED CONFIRMED CEILING

Hypothesis

Providing the gold answer during retry should maximize revision success rate (the model knows exactly what to aim for). This is the original STaR mechanism (Zelikman et al., 2022) and what He et al. (2026) call "SD-Zero." It should be the best possible single-retry method.

Expected: Best outcome for retry methods. Sets the ceiling.

Actual: +12.0pp, confirmed as ceiling. But only ~3pp above bare "try again" (+8.8pp).

Method

  1. First attempt: Generate one solution per problem at T=0.7.
  2. Retry with gold answer: For failures, append "The correct answer is [GOLD]. Please solve the problem showing how to arrive at this answer."
  3. Filter: Keep only solutions that correctly derive the gold answer.
  4. SFT: Fine-tune on (originally correct) + (successful rationalizations).

This is "SD-Zero" from He et al. (2026): self-revision conditioned on gold answer, filtered, SFT.

Configuration

ModelQwen3-1.7B
DatasetNuminaMath-CoT-10k
Eval benchmarkMATH-500 (pass@1)
Training steps500
Learning rate2e-5
LoRA rank16
Seeds42, 123, 456
PI contentGold answer (correct)
Revision success~60%
Hardware1x H200 (p5en.48xl)
Runtime~2.5h per seed

Results

SeedBaselinePost-trainingDelta
4240.8%52.9%+12.1pp
12340.8%52.1%+11.3pp
45640.8%53.5%+12.7pp
Mean40.8%52.8%+12.0 +/- 0.7pp

Decomposing the +12.0pp

ComponentContributionEvidence
Failure signal (retry trigger)+8.8pp (73%)Bare "try again" gives +8.8pp
Higher success rate from gold+3.2pp (27%)Gap: 12.0 - 8.8 = 3.2pp
Information guiding reasoning~0ppWrong PI gives +9.8pp (noise band)

Ceiling interpretation: The gold answer raises retry success from ~49% to ~60%, producing ~3,900 correct revisions vs ~3,200. The extra training data accounts for the ~3pp gap over bare retry. Information content contributes through DATA VOLUME, not reasoning quality.

Training Curves

Stored at: /data/ughai-sandbox/opsd_experiments/sd_zero_gold/. Training converges faster than bare retry due to more examples. Validation loss is slightly better at step 200 but converges similarly by step 500.

Interpretation

Gold-answer STaR was previously believed to work via "rationalization." Our ablation battery reveals:

A practitioner who lacks gold answers loses only ~3pp by using bare retry. The gold answer is a convenience for data yield, not a fundamental requirement.

Connection to Other Experiments

STaR "Try Again" (+8.8pp) - the 73% baseline
Nearly all the gain comes from retry itself. Gold answer provides diminishing marginal value.
SFT on Gold Solutions (+0.6pp) - supervision fails
Training on expert solutions gives +0.6pp. Self-generation with retry gives +12.0pp. The self-generation mechanism is 20x more effective than imitation.
8B SD-Zero (+3.0pp) - scale challenges
At 8B, gold-answer STaR gives only +3.0pp (vs +12.0 at 1.7B). LoRA SFT becomes less effective at larger scale, likely hyperparameter sensitivity.
Binary Filter (+12.9pp single seed) - filtering is king
Binary STaR at seed 42 gives +12.9pp, slightly exceeding gold-answer mean. Strict binary filtering matters more than PI content.