The ceiling condition. Model revises with the gold answer visible, producing "rationalized" solutions. Sets the upper bound for all retry-based methods at +12.0pp with tight reproducibility.
Providing the gold answer during retry should maximize revision success rate (the model knows exactly what to aim for). This is the original STaR mechanism (Zelikman et al., 2022) and what He et al. (2026) call "SD-Zero." It should be the best possible single-retry method.
Expected: Best outcome for retry methods. Sets the ceiling.
Actual: +12.0pp, confirmed as ceiling. But only ~3pp above bare "try again" (+8.8pp).
This is "SD-Zero" from He et al. (2026): self-revision conditioned on gold answer, filtered, SFT.
| Seed | Baseline | Post-training | Delta |
|---|---|---|---|
| 42 | 40.8% | 52.9% | +12.1pp |
| 123 | 40.8% | 52.1% | +11.3pp |
| 456 | 40.8% | 53.5% | +12.7pp |
| Mean | 40.8% | 52.8% | +12.0 +/- 0.7pp |
| Component | Contribution | Evidence |
|---|---|---|
| Failure signal (retry trigger) | +8.8pp (73%) | Bare "try again" gives +8.8pp |
| Higher success rate from gold | +3.2pp (27%) | Gap: 12.0 - 8.8 = 3.2pp |
| Information guiding reasoning | ~0pp | Wrong PI gives +9.8pp (noise band) |
Ceiling interpretation: The gold answer raises retry success from ~49% to ~60%, producing ~3,900 correct revisions vs ~3,200. The extra training data accounts for the ~3pp gap over bare retry. Information content contributes through DATA VOLUME, not reasoning quality.
Stored at: /data/ughai-sandbox/opsd_experiments/sd_zero_gold/. Training converges faster than bare retry due to more examples. Validation loss is slightly better at step 200 but converges similarly by step 500.
Gold-answer STaR was previously believed to work via "rationalization." Our ablation battery reveals:
A practitioner who lacks gold answers loses only ~3pp by using bare retry. The gold answer is a convenience for data yield, not a fundamental requirement.