Privileged Information for LLM Post-Training: Mechanism Discovery
PI content is irrelevant for LLM self-improvement. Gold answers, gibberish, wrong answers, and bare "try again" all produce statistically indistinguishable gains (+8.8 to +12.0pp). The mechanism is failure-aware distribution shift, not information content. One bit (pass/fail) is all you need. Exception: Lean theorem proving, where structural PI (have-skeleton) gives +3.3pp but retry HURTS (-2.4pp).
Emerging hypothesis (Day 2): Verifier quality, not PI quality, is the binding constraint. Iterative retry (+14.5pp over 2 rounds) shows the pipeline compounds when the verifier is perfect (binary match). Scaling to 8B requires LR tuning (lr=1e-6 gives +2.8pp; default 5e-7 gives 0). First-step hints replicate at +3.6 +/- 1.3pp (3-seed). Path diversity is surprisingly similar across domains (math 0.992, Lean 0.969).
Day 3 update: 8B full fine-tune matches 1.7B LoRA (+9.6pp verified, 36.8% -> 46.4%). 32B hits inverted-U right tail (+0.6pp, baseline 68.3% too high). Iterative retry peaks at 2 rounds; round 3 regresses -1.6pp. Lean type-checker and failure-diagnosis PI both give -1.2pp (better than heuristic but still negative). Lean heuristic 3-iter degrades progressively (-3.6pp). Verifier quality hypothesis PARTIALLY RESOLVED: both path narrowness AND verifier strictness matter.
| Condition | Seeds | Mean +/- Std | Individual |
|---|---|---|---|
| Gold answer STaR | 3 | +12.0 +/- 0.7pp | +12.1, +11.3, +12.7 |
| Gibberish "XYZZY" | 3 | +10.8 +/- 1.7pp | +10.7, +12.5, +9.1 |
| Wrong answers (shuffled) | 3 | +9.8 +/- 0.1pp | +9.7, +9.9, +9.9 |
| "Try again" (no target) | 3 | +8.8 +/- 1.4pp | +10.1, +7.3, +9.1 |
| OPSD answer-only (5-seed) | 5 | +0.04 +/- 0.9pp | -1.4, +0.4, +0.0, +1.0, +0.2 |
| Domain | Model | Method | Gain | Note |
|---|---|---|---|---|
| Math (MATH-500) | Qwen3-1.7B | Retry+filter | +8.8pp | 3-seed confirmed |
| Code (HumanEval) | Qwen2.5-Coder-1.5B | Retry+filter | +23.2pp | 1 seed |
| Lean (MiniF2F) | Kimina-1.5B | Have-skeleton OPSD | +3.3pp | Genuine PI value |
| Lean (MiniF2F) | Kimina-1.5B | Retry (bare) | -2.4pp | Retry hurts Lean |
| Math (MATH-500) | Qwen3-8B | Retry+filter | +0.0pp | Scale-dependent / LR issue |
| Code (MBPP) | Qwen2.5-Coder-1.5B | Retry+filter | +1.9pp | Baseline too low (5.8%) |
| Method | Model | Domain | Gain | Seeds | Note |
|---|---|---|---|---|---|
| Code retry STaR | Qwen2.5-Coder-1.5B | HumanEval | +23.2pp | 1 | Retry dominates code |
| Iterative retry (2 rounds) | Qwen3-1.7B | MATH-500 | +14.5pp | 1 | Round 1 +10.1, round 2 +4.4 |
| Iterative retry (3 rounds) | Qwen3-1.7B | MATH-500 | R3: -1.6pp | 1 | Round 3 regresses; optimal is 2 |
| Binary STaR (retry) | Qwen3-1.7B | MATH-500 | +12.9pp | 1 | Binary filter optimal |
| 8B full fine-tune (verified) | Qwen3-8B | MATH-500 | +9.6pp | 1 | Full FT matches 1.7B LoRA (36.8% -> 46.4%) |
| Gold-answer STaR | Qwen3-1.7B | MATH-500 | +12.0pp | 3 | Ceiling condition |
| Gibberish STaR | Qwen3-1.7B | MATH-500 | +10.8pp | 3 | Nonsense = gold |
| Wrong-answer STaR | Qwen3-1.7B | MATH-500 | +9.8pp | 3 | Wrong = correct PI |
| "Try again" STaR | Qwen3-1.7B | MATH-500 | +8.8pp | 3 | No info needed |
| Retry K=3 | Qwen3-1.7B | MATH-500 | +10.1pp | 1 | 3 retries per problem; optimal K |
| Retry K=2 | Qwen3-1.7B | MATH-500 | +9.1pp | 1 | 2 retries per problem; +0.3pp over K=1 |
| Retry K=5 | Qwen3-1.7B | MATH-500 | +8.5pp | 1 | 5 retries per problem; regression from K=3 (overfitting) |
| Test-suite PI (shuffled) | Qwen2.5-Coder-1.5B | HumanEval | +9.1pp | 1 | Wrong tests = real tests |
| Cross-model hints (8B to 1.7B) | Qwen3-1.7B | MATH-500 | +5.8pp | 1 | No gold needed |
| Cross-model retry (8B for 1.7B) | Qwen3-1.7B | MATH-500 | +4.2pp | 1 | 8B solutions for 1.7B student; modest gain despite low retry rate |
| OPSD PCCG (frontier) | Qwen3-1.7B | MATH-500 | +3.6pp | 1 | Best OPSD variant |
| First-step hint (3-seed) | Qwen3-1.7B | MATH-500 | +3.6pp | 3 | +3.6 +/- 1.3pp reproducible |
| Have-skeleton OPSD | Kimina-1.5B | MiniF2F | +3.3pp | 1 | Genuine PI for Lean |
| 8B retry (lr=1e-6) | Qwen3-8B | MATH-500 | +2.8pp | 1 | Optimal LR for 8B scale |
| Graded verifier | Qwen3-1.7B | MATH-500 | +4.2pp | 1 | 3x worse than binary |
| Temp-uniform N=8 | Qwen3-1.7B | MATH-500 | +2.6pp | 1 | No failure signal |
| Code OPSD (test results) | Qwen2.5-Coder-1.5B | HumanEval | +2.4pp | 1 | Weak vs retry +23pp |
| MBPP retry (1.5B) | Qwen2.5-Coder-1.5B | MBPP | +1.9pp | 1 | Baseline too low (5.8%) |
| Double-sample N=16 | Qwen3-1.7B | MATH-500 | +1.4pp | 1 | 7x worse than retry |
| IPRS (PI reward shaping) | Qwen3-1.7B | MATH-500 | +1.4pp | 1 | RL marginal vs retry+SFT |
| Frontier-rejection N=16 | Qwen3-1.7B | MATH-500 | +1.2pp | 1 | Frontier targeting = nothing |
| 8B retry (lr=2e-6) | Qwen3-8B | MATH-500 | +1.0pp | 1 | Overshoot LR for 8B |
| OPSD (5-seed mean) | Qwen3-1.7B | MATH-500 | +0.04pp | 5 | NULL confirmed |
| 32B retry (LoRA) | Qwen3-32B | MATH-500 | +0.6pp | 1 | Right tail inverted-U (baseline 68.3%) |
| 8B retry (lr=5e-7) | Qwen3-8B | MATH-500 | +0.0pp | 1 | LR too low for 8B SFT |
| MBPP retry (7B) | Qwen2.5-Coder-7B | MBPP | -0.4pp | 1 | Baseline 66.5%, no gain |
| Compute-matched (2x) | Qwen3-1.7B | MATH-500 | -0.6pp | 1 | More compute hurts |
| Rejection sampling | Qwen3-1.7B | MATH-500 | -2.0pp | 1 | Self-SFT hurts |
| Lean type-checker PI | Kimina-1.5B | MiniF2F | -1.2pp | 1 | Compiler errors as PI; better than bare retry but still negative |
| Lean failure-diagnosis PI | Kimina-1.5B | MiniF2F | -1.2pp | 1 | LLM-generated diagnosis; error messages don't help |
| Lean retry (bare) | Kimina-1.5B | MiniF2F | -2.4pp | 1 | Retry hurts Lean |
| Lean skeleton+retry | Kimina-1.5B | MiniF2F | -2.0pp | 1 | Skeleton PI + retry still negative; structural PI does not rescue retry |
| Lean heuristic 3-iter | Kimina-1.5B | MiniF2F | -3.6pp | 1 | Progressive degradation; iterative retry is poison for Lean |
| Path diversity (diagnostic): math 0.992, Lean 0.969. Surprisingly similar; path narrowness alone does not explain Lean failure. | |||||
The double-sample control (+1.4pp) vs retry (+10.1pp) proves that the failure signal genuinely shifts what the model generates. It is not more sampling; it is qualitatively different reasoning.
| What teacher sees | Gain | Interpretation |
|---|---|---|
| Gold answer (correct) | +12.0pp | Ceiling (slightly helps target correct approach) |
| Gibberish "XYZZY" | +10.8pp | Nonsense = same as gold |
| Wrong answer (shuffled) | +9.8pp | Wrong = same as correct |
| Nothing ("try again") | +8.8pp | No info needed at all |
| All conditions overlap at 95% CI. The CONTENT of the hint is irrelevant. | ||
| Factor | Evidence | Effect |
|---|---|---|
| Failure notification | Double-sample +1.4 vs retry +10.1 | 7x multiplier |
| Binary correctness filter | Binary +12.9 vs graded +4.2 | 3x multiplier |
| Frontier data selection | PCCG +3.6 vs full-data +0.04 | 90x multiplier |
| Model competence (baseline) | MBPP 5.8% +1.9pp vs HumanEval 54% +23pp | Threshold effect |
| Solution path diversity | Math retry +8.8pp vs Lean retry -2.4pp | Domain boundary |
OPSD gains are concentrated at Level 3-4 (the frontier band, 36-46% baseline solve rate). Levels that are too easy or too hard show minimal/no improvement. This is the inverted-U operating AT THE PROBLEM LEVEL.
| Level | Base | OPSD | Delta | Interpretation |
|---|---|---|---|---|
| 1 (easy) | 69.8% | 72.1% | +2.3pp | Already solved; ceiling effect |
| 2 | 60.0% | 60.0% | +0.0pp | High baseline, nothing to learn |
| 3 | 46.1% | 52.0% | +5.9pp | Frontier band (peak gain) |
| 4 | 36.2% | 40.9% | +4.7pp | Frontier band (strong gain) |
| 5 (hard) | 18.7% | 17.9% | -0.7pp | Too hard; retries also fail |
Key insight: Gains concentrate at the frontier (30-50% solve rate per level). The inverted-U is not just a model-level phenomenon; it operates per-problem within a single model.
Lean theorem proving is the domain exception. The complete hierarchy shows that ONLY structural PI (have-skeleton OPSD) produces positive gains. All retry-based methods are negative, and combining skeleton with retry interferes rather than compounds.
Key insight: Skeleton+retry (-2.0pp) is WORSE than skeleton alone (+3.3pp). The retry mechanism actively destroys the benefit of structural PI. In constrained domains, retry forces the model to abandon its skeleton-guided strategy, exploring alternatives that violate the type-checker. The mechanisms are antagonistic, not additive.
Worst case: Lean heuristic 3-iter reaches -3.6pp (progressive degradation). Iterative retry is poison for formal domains.
Retry IS PI distillation where the privileged information is one bit: "you failed." This suppresses the model's dominant (incorrect) solution mode, activates alternative reasoning paths, and SFT on successful retries teaches the model to use failure-aware reasoning as its default first-attempt behavior. At test time, the model produces "second-attempt quality" solutions without needing the failure prompt.
Expected gain: +8-12pp at 1.7B scale on math, +23pp on code (if baseline > 40%). Exception: formal verification (Lean) needs structural PI.
Retry gains are maximized when the model has moderate baseline competence (30-60%). Too easy (nothing to retry) or too hard (retries also fail) both yield zero gain.
| Model | Domain | Baseline | Retry Gain | Position on Curve |
|---|---|---|---|---|
| Qwen2.5-Coder-1.5B | HumanEval | 54.2% | +23.2pp | Sweet spot |
| Qwen3-1.7B | MATH-500 | 39.7% | +8.8pp | Sweet spot |
| Qwen3-8B (LoRA) | MATH-500 | 35.5% | +2.8pp | Needs LR tuning |
| Qwen3-8B (full FT) | MATH-500 | 36.8% | +9.6pp | Full FT recovers 1.7B gains (verified: 46.4%) |
| Qwen3-32B | MATH-500 | 68.3% | +0.6pp | Right tail (baseline too high) |
| Qwen2.5-Coder-1.5B | MBPP | 5.8% | +1.9pp | Left tail (too hard) |
| Qwen2.5-Coder-7B | MBPP | 66.5% | -0.4pp | Right tail (too easy) |
| Kimina-1.5B | MiniF2F | 39.3% | -2.4pp | Domain exception (Lean) |
The sweet spot appears to be 30-55% baseline accuracy for math/code domains. Below ~10%, the model cannot produce correct retries. Above ~65%, there is nothing left to learn from retry.
The 8B null result was a hyperparameter artifact. The optimal LR shifts upward with model scale, likely because LoRA updates need larger steps to overcome the model's stronger priors.
| Scale | Optimal LR | Best Gain | Retry Rate | Note |
|---|---|---|---|---|
| 1.7B | 5e-7 (default) | +8.8pp | ~35% | LoRA r=32 is sufficient |
| 8B (LoRA) | 1e-6 | +2.8pp | ~28% | LoRA r=64, 2x LR needed |
| 8B (full FT) | 1e-6 | +9.6pp | ~28% | Full FT recovers full gains (verified: 36.8% -> 46.4%) |
| 32B (LoRA) | 2e-6 | +0.6pp | ~12% | Baseline 68.3% too high; inverted-U right tail |
The 32B result (+0.6pp) confirms the inverted-U hypothesis but for a different reason than expected. The issue is NOT LoRA capacity or LR tuning; it is that 32B's baseline (68.3%) is already past the sweet spot. There is nothing left to learn from retry at this performance level.
Implication: Retry is a method for STRUGGLING models (30-55% baseline). At 32B, the model is too good for retry to help on MATH-500. It might still help on harder benchmarks (AIME, putnam-level) where 32B baseline is lower.
Increasing retries per problem shows diminishing and then negative returns. K=3 is optimal; K=5 regresses, likely due to overfitting on low-quality late retries.
| Retries (K) | Gain | Delta vs K-1 | Interpretation |
|---|---|---|---|
| K=1 | +8.8pp | -- | Baseline retry (single attempt after failure) |
| K=2 | +9.1pp | +0.3pp | Slight improvement from extra attempt |
| K=3 (optimal) | +10.1pp | +1.0pp | Peak gain; best cost/benefit tradeoff |
| K=5 | +8.5pp | -1.6pp | Regression; late retries are low-quality and dilute training signal |
Key insight: Retry count has its own inverted-U. More retries initially help (more chances to find correct solutions) but past K=3, late-retry solutions are low quality (the model has exhausted its good alternative strategies) and training on them dilutes the signal.
| Job | Category | Status | Expected | Notes |
|---|---|---|---|---|
| ughai-opsd-skeleton-s123 | Lean multi-seed | running | ~4h | Have-skeleton 3-seed confirmation (seed 123) |
| ughai-opsd-skeleton-s456 | Lean multi-seed | running | ~4h | Have-skeleton 3-seed confirmation (seed 456) |
| ughai-sd-zero-opsd-hybrid | Mechanism | ~44h in | ~48h | SD-Zero + OPSD combination (finishing) |
Cluster: CMH (us-east-2), p5en-queue. Completed Day 3: 8B full FT (+9.6pp verified), 32B retry (+0.6pp), iterative R3 (-1.6pp), Lean type-checker PI (-1.2pp), Lean failure-diagnosis (-1.2pp), Lean heuristic 3-iter (-3.6pp), HAR math, skeleton-guided retry, skeleton+retry (-2.0pp).