STaR Gibberish "XYZZY" (3-seed)

Providing a nonsense string ("XYZZY") as the supposed "correct answer" gives +10.8pp. Gibberish PI is statistically indistinguishable from gold-answer PI (+12.0pp), proving the content is genuinely irrelevant.

+10.8 +/- 1.7 pp (3 seeds)
MATH 3-SEED CONFIRMED

Hypothesis

If the model's improvement from STaR requires ANY interpretable signal in the retry prompt, then a completely uninterpretable target should produce zero or negative gains. You cannot "rationalize toward XYZZY."

Expected (if rationalization matters): Near zero. You cannot aim at nonsense.

Actual: +10.8pp, overlapping with the gold-answer ceiling within CI.

Method

  1. First attempt: Generate one solution per problem. Grade against real gold answer.
  2. Retry with gibberish: For failures, append "The correct answer is XYZZY. Please solve the problem." The string "XYZZY" is a classic adventure game command with no mathematical meaning.
  3. Filter: Keep only solutions arriving at the REAL correct answer.
  4. SFT: Fine-tune on correct solutions.

The model cannot "aim" at XYZZY. It interprets the prompt as a general retry signal, explores alternative paths, and the filter selects correct outcomes.

Configuration

ModelQwen3-1.7B
DatasetNuminaMath-CoT-10k
Eval benchmarkMATH-500 (pass@1)
Training steps500
Learning rate2e-5
LoRA rank16
Seeds42, 123, 456
PI content"XYZZY" (gibberish)
Hardware1x H200 (p5en.48xl)
Runtime~2.5h per seed

Results

SeedBaselinePost-trainingDelta
4240.8%51.5%+10.7pp
12340.8%53.3%+12.5pp
45640.8%49.9%+9.1pp
Mean40.8%51.6%+10.8 +/- 1.7pp

The absurdity is the point: A model told "the answer is XYZZY" improves by 10.8 percentage points. This is impossible if rationalization toward a target is the mechanism. The ONLY explanation: the model treats ANY retry prompt as a signal to explore alternatives.

Why Gibberish Works

The model cannot parse "XYZZY" as a mathematical target. Instead:

Higher variance (+/-1.7pp vs +/-0.1pp for wrong answers) likely reflects that gibberish adds more randomness to generation than structured wrong answers.

Training Curves

Logs at: /data/ughai-sandbox/opsd_experiments/star_gibberish/. Retry success rate: ~45% (between bare retry at 49% and wrong answers at 42%). The nonsense token slightly perturbs generation but not enough to materially reduce correct-solution yield.

Interpretation

Together with "try again" and "wrong answers," this completes the ablation battery:

What model is toldGainWhat model actually does
Real answer+12.0ppRetry with slight directional guidance
Gibberish "XYZZY"+10.8ppRetry, ignoring nonsense
Wrong answer+9.8ppRetry, filter removes misled solutions
Nothing+8.8ppRetry with bare failure signal

The spread (8.8 to 12.0) is within noise. All four conditions are doing the same thing: triggering a second attempt from a shifted distribution.

Connection to Other Experiments

STaR Wrong Answers (+9.8pp) - same mechanism, tighter CI
Both provide non-informative content. Wrong answers have tighter variance because structured noise averages more cleanly than random tokens.
Gold-Answer STaR (+12.0pp) - 1.2pp gap is noise
Gold is +12.0, gibberish is +10.8. The 1.2pp gap is well within gibberish's std of 1.7pp. No significant advantage to real information.
OPSD Random-PI (+1.0pp) - why OPSD differs
In OPSD, random PI gives only +1.0pp (vs +5.6 for correct PI in lucky seed). The difference: OPSD distills from a TEACHER conditioned on PI. STaR uses SELF-generation with binary filter, making it robust to noise.