Gold-Answer STaR / SD-Zero (3-seed)

The ceiling condition. Model revises with the gold answer visible, producing "rationalized" solutions. Sets the upper bound for all retry-based methods at +12.0pp with tight reproducibility.

+12.0 +/- 0.7 pp (3 seeds)

MATH 3-SEED CONFIRMED CEILING

Hypothesis

Providing the gold answer during retry should maximize revision success rate (the model knows exactly what to aim for). This is the original STaR mechanism (Zelikman et al., 2022) and what He et al. (2026) call "SD-Zero." It should be the best possible single-retry method.

Expected: Best outcome for retry methods. Sets the ceiling.

Actual: +12.0pp, confirmed as ceiling. But only ~3pp above bare "try again" (+8.8pp).

Method

First attempt: Generate one solution per problem at T=0.7.
Retry with gold answer: For failures, append "The correct answer is [GOLD]. Please solve the problem showing how to arrive at this answer."
Filter: Keep only solutions that correctly derive the gold answer.
SFT: Fine-tune on (originally correct) + (successful rationalizations).

This is "SD-Zero" from He et al. (2026): self-revision conditioned on gold answer, filtered, SFT.

Configuration

ModelQwen3-1.7B

DatasetNuminaMath-CoT-10k

Eval benchmarkMATH-500 (pass@1)

Training steps500

Learning rate2e-5

LoRA rank16

Seeds42, 123, 456

PI contentGold answer (correct)

Revision success~60%

Hardware1x H200 (p5en.48xl)

Runtime~2.5h per seed

Results

Seed	Baseline	Post-training	Delta
42	40.8%	52.9%	+12.1pp
123	40.8%	52.1%	+11.3pp
456	40.8%	53.5%	+12.7pp
Mean	40.8%	52.8%	+12.0 +/- 0.7pp

Decomposing the +12.0pp

Component	Contribution	Evidence
Failure signal (retry trigger)	+8.8pp (73%)	Bare "try again" gives +8.8pp
Higher success rate from gold	+3.2pp (27%)	Gap: 12.0 - 8.8 = 3.2pp
Information guiding reasoning	~0pp	Wrong PI gives +9.8pp (noise band)

Ceiling interpretation: The gold answer raises retry success from ~49% to ~60%, producing ~3,900 correct revisions vs ~3,200. The extra training data accounts for the ~3pp gap over bare retry. Information content contributes through DATA VOLUME, not reasoning quality.

Training Curves

Stored at: /data/ughai-sandbox/opsd_experiments/sd_zero_gold/. Training converges faster than bare retry due to more examples. Validation loss is slightly better at step 200 but converges similarly by step 500.

Interpretation

Gold-answer STaR was previously believed to work via "rationalization." Our ablation battery reveals:

73% of the gain comes from the retry mechanism alone (no information needed)
27% comes from higher data yield (gold answer makes more retries succeed)
0% comes from the information "guiding" reasoning quality

A practitioner who lacks gold answers loses only ~3pp by using bare retry. The gold answer is a convenience for data yield, not a fundamental requirement.

Connection to Other Experiments

STaR "Try Again" (+8.8pp) - the 73% baseline

Nearly all the gain comes from retry itself. Gold answer provides diminishing marginal value.

SFT on Gold Solutions (+0.6pp) - supervision fails

Training on expert solutions gives +0.6pp. Self-generation with retry gives +12.0pp. The self-generation mechanism is 20x more effective than imitation.

8B SD-Zero (+3.0pp) - scale challenges

At 8B, gold-answer STaR gives only +3.0pp (vs +12.0 at 1.7B). LoRA SFT becomes less effective at larger scale, likely hyperparameter sensitivity.

Binary Filter (+12.9pp single seed) - filtering is king

Binary STaR at seed 42 gives +12.9pp, slightly exceeding gold-answer mean. Strict binary filtering matters more than PI content.