PI Distillation Research Dashboard

Privileged Information for LLM Post-Training: Mechanism Discovery

Last updated: 2026-05-11 (auto-generated)

Headline Finding

PI content is irrelevant for LLM self-improvement. Gold answers, gibberish, wrong answers, and bare "try again" all produce statistically indistinguishable gains (+8.8 to +12.0pp). The mechanism is failure-aware distribution shift, not information content. One bit (pass/fail) is all you need. Exception: Lean theorem proving, where structural PI (have-skeleton) gives +3.3pp but retry HURTS (-2.4pp).

Emerging hypothesis (Day 2): Verifier quality, not PI quality, is the binding constraint. Iterative retry (+14.5pp over 2 rounds) shows the pipeline compounds when the verifier is perfect (binary match). Scaling to 8B requires LR tuning (lr=1e-6 gives +2.8pp; default 5e-7 gives 0). First-step hints replicate at +3.6 +/- 1.3pp (3-seed). Path diversity is surprisingly similar across domains (math 0.992, Lean 0.969).

Day 3 update: 8B full fine-tune matches 1.7B LoRA (+9.6pp verified, 36.8% -> 46.4%). 32B hits inverted-U right tail (+0.6pp, baseline 68.3% too high). Iterative retry peaks at 2 rounds; round 3 regresses -1.6pp. Lean type-checker and failure-diagnosis PI both give -1.2pp (better than heuristic but still negative). Lean heuristic 3-iter degrades progressively (-3.6pp). Verifier quality hypothesis PARTIALLY RESOLVED: both path narrowness AND verifier strictness matter.

Experiments
Overview
Timeline
Full Results
Mechanism
Theory
Scaling
Running Now
All
Math
Code
Lean
STaR "Try Again" (3-seed) +8.8pp
Bare "try again" with no PI gives +8.8pp mean. The headline null result that proves PI content is irrelevant.
Math 3 seeds
STaR Wrong Answers (3-seed) +9.8pp
Shuffled (wrong) answers as PI gives identical gain to correct answers. Tightest CI of all conditions (+/-0.1pp).
Math 3 seeds
STaR Gibberish "XYZZY" (3-seed) +10.8pp
Nonsense string as PI gives statistically indistinguishable gain from gold answer. Content is genuinely irrelevant.
Math 3 seeds
Gold-Answer STaR (3-seed) +12.0pp
The ceiling condition. Full gold answer as PI. Sets the upper bound for all retry-based methods. Reproducible across seeds.
Math 3 seeds
Double-Sample N=16 (no retry) +1.4pp
16 first-attempt solutions, filter correct, SFT. 7x worse than retry. Proves failure signal IS causal, not just more sampling.
Math 1 seed
OPSD 5-Seed Null Result +0.04pp
The experiment that killed OPSD. 5 seeds average +0.04pp. Original +5.6pp was a lucky seed. PI distillation does not work.
Math 5 seeds
Graded Verifier vs Binary Filter +4.2 vs +12.9
Partial credit (graded) is 3x worse than binary pass/fail filtering. Near-miss solutions actively dilute the training signal.
Math 1 seed
Code Retry (HumanEval) +23.2pp
Massive gain from retry on code. Cross-domain replication confirms the mechanism is not math-specific.
Code 1 seed
Lean Have-Skeleton OPSD +3.3pp
The one domain where PI genuinely matters. Structural proof decomposition helps the model select correct proof strategies.
Lean 1 seed
Lean Retry (no PI) -2.4pp
Retry HURTS Lean. Unlike math/code, bare retry degrades performance. Structural PI is genuinely needed for formal domains.
Lean 1 seed
Iterative Retry (2 rounds) +14.5pp
Two rounds of retry+filter+SFT compound: round 1 +10.1pp, round 2 +4.4pp. Exceeds gold-answer ceiling. Verifier quality enables iteration.
Math 1 seed
First-Step Hint (3-seed) +3.6pp
Providing the first proof step as PI gives +3.6 +/- 1.3pp across 3 seeds. Modest but reproducible; partial PI has partial value.
Math 3 seeds
8B LR Sweep (resolved) +2.8pp
8B null was a hyperparameter issue: lr=5e-7 +0.0, lr=1e-6 +2.8pp, lr=2e-6 +1.0pp. Optimal LR shifts with scale. Retry works at 8B.
Math 3 LRs
IPRS (PI Reward Shaping) +1.4pp
GRPO with PI-informed progress reward. RL approaches remain marginal compared to retry+filter+SFT (+8-12pp).
Math 1 seed
Frontier-Rejection N=16 +1.2pp
Selecting frontier-difficulty problems for rejection sampling adds negligible value. Problem targeting is not the bottleneck.
Math 1 seed
MBPP 7B Retry -0.4pp
7B model on MBPP (baseline 66.5%) shows no gain from retry. Further evidence that scaling requires careful hyperparameter tuning.
Code 1 seed
Path Diversity Analysis diagnostic
Surprise: math (0.992) and Lean (0.969) have similar path diversity. The retry-fails-for-Lean puzzle is NOT explained by solution-path narrowness alone.
Math Lean
8B Full Fine-Tune (verified) +9.6pp
Full fine-tune at 8B matches 1.7B LoRA gains (36.8% -> 46.4%). Confirms LoRA capacity was the constraint at scale, not the method.
Math 1 seed
32B Retry (inverted-U right tail) +0.6pp
32B baseline is 68.3%, too high for retry gains. Confirms inverted-U: above ~65% baseline, nothing left to learn from retry.
Math 1 seed
Iterative Retry (3 rounds) R3: -1.6pp
Round 3 regresses. Optimal is 2 rounds (+14.5pp). Third round overfits on diminishing-quality retry data. Confirms diminishing returns.
Math 1 seed
Lean Type-Checker PI -1.2pp
Using type-checker error messages as PI for retry. Better than heuristic (-2.4pp) but still negative. Compiler feedback is insufficient to guide proof strategy.
Lean 1 seed
Lean Failure-Diagnosis PI -1.2pp
LLM-generated failure diagnosis as retry PI. Error messages don't help; the model cannot diagnose its own proof failures usefully.
Lean 1 seed
Lean Heuristic 3-Iter -3.6pp
Progressive degradation over 3 iterations of heuristic-guided retry on Lean. Each round compounds errors. Iterative retry is poison for Lean.
Lean 1 seed
Lean Skeleton+Retry -2.0pp
Combining skeleton PI with retry still yields negative results. Structural PI alone helps (+3.3pp) but adding retry on top degrades it. The two mechanisms interfere rather than compound.
Lean 1 seed

4-Level Mechanism Hierarchy

LEVEL 1 Failure-aware retry + binary filter + SFT +8-12pp
LEVEL 2 Frontier data selection (PCCG, more samples, no failure signal) +1-4pp
LEVEL 3 OPSD / continuous PI distillation (full data) +0.04pp
LEVEL 4 Graded verifier / partial credit (harmful) -8pp vs binary

Multi-Seed Confidence Intervals

ConditionSeedsMean +/- StdIndividual
Gold answer STaR3+12.0 +/- 0.7pp+12.1, +11.3, +12.7
Gibberish "XYZZY"3+10.8 +/- 1.7pp+10.7, +12.5, +9.1
Wrong answers (shuffled)3+9.8 +/- 0.1pp+9.7, +9.9, +9.9
"Try again" (no target)3+8.8 +/- 1.4pp+10.1, +7.3, +9.1
OPSD answer-only (5-seed)5+0.04 +/- 0.9pp-1.4, +0.4, +0.0, +1.0, +0.2

Cross-Domain Summary

DomainModelMethodGainNote
Math (MATH-500)Qwen3-1.7BRetry+filter+8.8pp3-seed confirmed
Code (HumanEval)Qwen2.5-Coder-1.5BRetry+filter+23.2pp1 seed
Lean (MiniF2F)Kimina-1.5BHave-skeleton OPSD+3.3ppGenuine PI value
Lean (MiniF2F)Kimina-1.5BRetry (bare)-2.4ppRetry hurts Lean
Math (MATH-500)Qwen3-8BRetry+filter+0.0ppScale-dependent / LR issue
Code (MBPP)Qwen2.5-Coder-1.5BRetry+filter+1.9ppBaseline too low (5.8%)

Research Timeline: How Understanding Evolved

Early May 2026
OPSD answer-only: +5.6pp (single seed)
Initial excitement. PI distillation seems to work. Teacher conditioned on gold answer generates better solutions that transfer to student.
May 5-6
Lean have-skeleton OPSD: +3.3pp
First positive Lean result. Structural PI (proof decomposition) helps theorem proving. Validates multi-domain generality.
May 7
SD-Zero (gold-answer STaR): +12.0pp (3-seed)
Breakthrough. Self-revision with gold answer, filter correct, SFT. 2x better than OPSD. Reproducible across seeds (+12.1, +11.3, +12.7).
May 8
OPSD 5-seed null: +0.04pp mean
The experiment that changed everything. Five seeds of OPSD average near zero. The original +5.6pp was a lucky seed. PI distillation has no reliable effect.
May 8
STaR "Try Again": +8.8pp (3-seed)
The headline finding. Bare "try again" with NO information gives +8.8pp. PI content is irrelevant; the retry mechanism is all that matters.
May 8
Wrong answers (+9.8pp) and Gibberish (+10.8pp)
Confirmation battery. Wrong PI and nonsense PI give same gains as correct PI. All conditions overlap at 95% CI. The field has been wrong about why STaR works.
May 9
Double-Sample N=16: +1.4pp (7x gap)
The causal proof. 16 first-attempt solutions give 7x less gain than one retry. This is NOT more sampling; the failure signal shifts what the model generates.
May 9
Graded verifier: +4.2pp vs binary +12.9pp
Binary is 3x better. Partial credit dilutes signal. The optimal pipeline uses the harshest possible filter.
May 9
Code retry (HumanEval): +23.2pp
Cross-domain replication. Retry dominates code too. The mechanism is domain-general (where baseline competence is sufficient).
May 10
Lean retry: -2.4pp (retry HURTS)
The exception that proves the rule. In formal domains with narrow solution paths, retry cannot find alternative strategies. Structural PI genuinely matters for Lean.
May 10
Iterative retry: +14.5pp (2 rounds)
The pipeline compounds: round 1 +10.1pp, round 2 +4.4pp (total +14.5pp). Exceeds gold-answer ceiling (+12.0pp). Verifier quality enables iteration since each round trains on freshly-correct solutions.
May 10
8B LR sweep resolves scaling: lr=1e-6 gives +2.8pp
The 8B null was a hyperparameter artifact. The default lr=5e-7 is too low for 8B SFT; lr=1e-6 recovers signal. Retry works at scale with proper tuning.
May 10
Path diversity surprise: math 0.992, Lean 0.969
Contrary to hypothesis, Lean solution paths are almost as diverse as math. The domain boundary is NOT explained by path narrowness alone. New hypothesis: verifier strictness (type-checker vs answer match) forces different failure modes.
May 10-11
8B Full FT: +9.6pp (verified)
Full fine-tuning at 8B (36.8% -> 46.4%) matches 1.7B LoRA gains. The LoRA capacity constraint at scale is confirmed. Method works with sufficient parameters.
May 10-11
32B: +0.6pp (inverted-U right tail)
32B baseline (68.3%) is past the sweet spot. Too few failures to retry. Confirms inverted-U: retry is for struggling models (30-55% baseline), not already-strong ones.
May 10-11
Iterative round 3: -1.6pp regression
Optimal is 2 rounds (+14.5pp total). Third round overfits on diminishing-quality retry data as the model improves and has fewer failures to learn from.
May 10-11
Lean PI ablation: type-checker and diagnosis both -1.2pp
The decisive Lean experiments. Type-checker error messages and LLM-generated failure diagnosis as PI both improve on bare retry (-2.4pp) but remain negative. Lean heuristic 3-iter reaches -3.6pp. The domain boundary is real: no form of retry-based PI rescues theorem proving.

All Experiment Results (Sorted by Gain)

MethodModelDomainGainSeedsNote
Code retry STaRQwen2.5-Coder-1.5BHumanEval+23.2pp1Retry dominates code
Iterative retry (2 rounds)Qwen3-1.7BMATH-500+14.5pp1Round 1 +10.1, round 2 +4.4
Iterative retry (3 rounds)Qwen3-1.7BMATH-500R3: -1.6pp1Round 3 regresses; optimal is 2
Binary STaR (retry)Qwen3-1.7BMATH-500+12.9pp1Binary filter optimal
8B full fine-tune (verified)Qwen3-8BMATH-500+9.6pp1Full FT matches 1.7B LoRA (36.8% -> 46.4%)
Gold-answer STaRQwen3-1.7BMATH-500+12.0pp3Ceiling condition
Gibberish STaRQwen3-1.7BMATH-500+10.8pp3Nonsense = gold
Wrong-answer STaRQwen3-1.7BMATH-500+9.8pp3Wrong = correct PI
"Try again" STaRQwen3-1.7BMATH-500+8.8pp3No info needed
Retry K=3Qwen3-1.7BMATH-500+10.1pp13 retries per problem; optimal K
Retry K=2Qwen3-1.7BMATH-500+9.1pp12 retries per problem; +0.3pp over K=1
Retry K=5Qwen3-1.7BMATH-500+8.5pp15 retries per problem; regression from K=3 (overfitting)
Test-suite PI (shuffled)Qwen2.5-Coder-1.5BHumanEval+9.1pp1Wrong tests = real tests
Cross-model hints (8B to 1.7B)Qwen3-1.7BMATH-500+5.8pp1No gold needed
Cross-model retry (8B for 1.7B)Qwen3-1.7BMATH-500+4.2pp18B solutions for 1.7B student; modest gain despite low retry rate
OPSD PCCG (frontier)Qwen3-1.7BMATH-500+3.6pp1Best OPSD variant
First-step hint (3-seed)Qwen3-1.7BMATH-500+3.6pp3+3.6 +/- 1.3pp reproducible
Have-skeleton OPSDKimina-1.5BMiniF2F+3.3pp1Genuine PI for Lean
8B retry (lr=1e-6)Qwen3-8BMATH-500+2.8pp1Optimal LR for 8B scale
Graded verifierQwen3-1.7BMATH-500+4.2pp13x worse than binary
Temp-uniform N=8Qwen3-1.7BMATH-500+2.6pp1No failure signal
Code OPSD (test results)Qwen2.5-Coder-1.5BHumanEval+2.4pp1Weak vs retry +23pp
MBPP retry (1.5B)Qwen2.5-Coder-1.5BMBPP+1.9pp1Baseline too low (5.8%)
Double-sample N=16Qwen3-1.7BMATH-500+1.4pp17x worse than retry
IPRS (PI reward shaping)Qwen3-1.7BMATH-500+1.4pp1RL marginal vs retry+SFT
Frontier-rejection N=16Qwen3-1.7BMATH-500+1.2pp1Frontier targeting = nothing
8B retry (lr=2e-6)Qwen3-8BMATH-500+1.0pp1Overshoot LR for 8B
OPSD (5-seed mean)Qwen3-1.7BMATH-500+0.04pp5NULL confirmed
32B retry (LoRA)Qwen3-32BMATH-500+0.6pp1Right tail inverted-U (baseline 68.3%)
8B retry (lr=5e-7)Qwen3-8BMATH-500+0.0pp1LR too low for 8B SFT
MBPP retry (7B)Qwen2.5-Coder-7BMBPP-0.4pp1Baseline 66.5%, no gain
Compute-matched (2x)Qwen3-1.7BMATH-500-0.6pp1More compute hurts
Rejection samplingQwen3-1.7BMATH-500-2.0pp1Self-SFT hurts
Lean type-checker PIKimina-1.5BMiniF2F-1.2pp1Compiler errors as PI; better than bare retry but still negative
Lean failure-diagnosis PIKimina-1.5BMiniF2F-1.2pp1LLM-generated diagnosis; error messages don't help
Lean retry (bare)Kimina-1.5BMiniF2F-2.4pp1Retry hurts Lean
Lean skeleton+retryKimina-1.5BMiniF2F-2.0pp1Skeleton PI + retry still negative; structural PI does not rescue retry
Lean heuristic 3-iterKimina-1.5BMiniF2F-3.6pp1Progressive degradation; iterative retry is poison for Lean
Path diversity (diagnostic): math 0.992, Lean 0.969. Surprisingly similar; path narrowness alone does not explain Lean failure.

The Mechanism: Failure-Aware Distribution Shift

The double-sample control (+1.4pp) vs retry (+10.1pp) proves that the failure signal genuinely shifts what the model generates. It is not more sampling; it is qualitatively different reasoning.

WITH failure signal
+8.8 to +12.0pp
Model activates alternative reasoning paths
WITHOUT failure signal
+1.2 to +2.6pp
Just more samples from same distribution

What Does NOT Matter (PI Content Ablations)

What teacher seesGainInterpretation
Gold answer (correct)+12.0ppCeiling (slightly helps target correct approach)
Gibberish "XYZZY"+10.8ppNonsense = same as gold
Wrong answer (shuffled)+9.8ppWrong = same as correct
Nothing ("try again")+8.8ppNo info needed at all
All conditions overlap at 95% CI. The CONTENT of the hint is irrelevant.

What DOES Matter

FactorEvidenceEffect
Failure notificationDouble-sample +1.4 vs retry +10.17x multiplier
Binary correctness filterBinary +12.9 vs graded +4.23x multiplier
Frontier data selectionPCCG +3.6 vs full-data +0.0490x multiplier
Model competence (baseline)MBPP 5.8% +1.9pp vs HumanEval 54% +23ppThreshold effect
Solution path diversityMath retry +8.8pp vs Lean retry -2.4ppDomain boundary

Per-Level Analysis: Gains Concentrate at the Frontier

OPSD gains are concentrated at Level 3-4 (the frontier band, 36-46% baseline solve rate). Levels that are too easy or too hard show minimal/no improvement. This is the inverted-U operating AT THE PROBLEM LEVEL.

LevelBaseOPSDDeltaInterpretation
1 (easy)69.8%72.1%+2.3ppAlready solved; ceiling effect
260.0%60.0%+0.0ppHigh baseline, nothing to learn
346.1%52.0%+5.9ppFrontier band (peak gain)
436.2%40.9%+4.7ppFrontier band (strong gain)
5 (hard)18.7%17.9%-0.7ppToo hard; retries also fail

Key insight: Gains concentrate at the frontier (30-50% solve rate per level). The inverted-U is not just a model-level phenomenon; it operates per-problem within a single model.

Three-Step Mechanistic Explanation

  1. Suppresses the dominant mode. The model's first attempt reflects its highest-probability solution path. "You failed" tells it that path is wrong, forcing exploration of lower-probability alternatives.
  2. Activates self-correction heuristics. Pre-training on human text includes patterns of "I made an error, let me reconsider..." that retry prompts activate. These encode approach-switching and deeper checking.
  3. Contracts the search space. Rather than re-exploring the full solution space (as N=16 i.i.d. does), the model eliminates its default approach, concentrating on alternatives. This is more efficient than blind resampling.

Lean: Complete 5-Method Hierarchy (MiniF2F, Kimina-1.5B)

Lean theorem proving is the domain exception. The complete hierarchy shows that ONLY structural PI (have-skeleton OPSD) produces positive gains. All retry-based methods are negative, and combining skeleton with retry interferes rather than compounds.

#1 Have-skeleton OPSD (structural PI, no retry) +3.3pp
#2 Type-checker PI (compiler errors as retry hint) -1.2pp
#3 Failure-diagnosis PI (LLM-generated error analysis) -1.2pp
#4 Skeleton+Retry (structural PI combined with retry) -2.0pp
#5 Bare retry ("try again", no PI) -2.4pp

Key insight: Skeleton+retry (-2.0pp) is WORSE than skeleton alone (+3.3pp). The retry mechanism actively destroys the benefit of structural PI. In constrained domains, retry forces the model to abandon its skeleton-guided strategy, exploring alternatives that violate the type-checker. The mechanisms are antagonistic, not additive.

Worst case: Lean heuristic 3-iter reaches -3.6pp (progressive degradation). Iterative retry is poison for formal domains.

Unified Theory: Failure as Privileged Information

Retry IS PI distillation where the privileged information is one bit: "you failed." This suppresses the model's dominant (incorrect) solution mode, activates alternative reasoning paths, and SFT on successful retries teaches the model to use failure-aware reasoning as its default first-attempt behavior. At test time, the model produces "second-attempt quality" solutions without needing the failure prompt.

Predictions and Status

PI content irrelevant — Wrong answers = correct answers = gibberish (confirmed 3-seed)
Failure signal is causal — Double-sample (no failure) gives 7x less gain
Binary filter optimal — Graded verifier 3x worse than binary
Simple beats complex — Static > dynamic, bare retry > scheduled temps
Lean: PI genuinely needed — Retry hurts (-2.4pp), skeleton helps (+3.3pp). Structural PI is real for constrained domains.
Cross-domain generality — Math +8.8pp, Code +23.2pp. Retry works wherever baseline competence exists.
8B scaling (RESOLVED) — Confirmed hyperparameter issue. lr=5e-7 +0.0pp, lr=1e-6 +2.8pp, lr=2e-6 +1.0pp. Optimal LR shifts with scale; retry works at 8B with proper tuning.
Verifier quality as binding constraint (RESOLVED) — Both hypotheses partially correct. Iterative retry compounds to 2 rounds (+14.5pp) but round 3 regresses (-1.6pp). Lean type-checker PI (-1.2pp) and failure-diagnosis (-1.2pp) are better than bare retry (-2.4pp) but still negative. Verifier strictness AND solution diversity both matter; neither alone explains the domain boundary.
Path diversity not the full story (RESOLVED) — Confirmed: path diversity alone does not explain domain boundary. Lean type-checker (-1.2pp) and failure-diagnosis (-1.2pp) show that even GOOD verifier feedback does not rescue retry for Lean. The issue is compound: narrow paths + high verification bar + inability to explore alternative proof strategies. Heuristic 3-iter (-3.6pp) shows progressive degradation, confirming the mechanism corrupts rather than improves in constrained domains.

Practitioner Recipe

  1. Generate first-attempt solutions (N=1 per problem)
  2. Verify against oracle (answer match, type-checker, test suite)
  3. Retry failures with "try again carefully" (any prompt works)
  4. Filter to ONLY correct retries (binary, no partial credit)
  5. SFT on (problem, correct_retry) pairs (LoRA, 500 steps)
  6. Iterate if desired (diminishing returns expected)

Expected gain: +8-12pp at 1.7B scale on math, +23pp on code (if baseline > 40%). Exception: formal verification (Lean) needs structural PI.

The Inverted-U: Baseline Competence vs Gain

Retry gains are maximized when the model has moderate baseline competence (30-60%). Too easy (nothing to retry) or too hard (retries also fail) both yield zero gain.

ModelDomainBaselineRetry GainPosition on Curve
Qwen2.5-Coder-1.5BHumanEval54.2%+23.2ppSweet spot
Qwen3-1.7BMATH-50039.7%+8.8ppSweet spot
Qwen3-8B (LoRA)MATH-50035.5%+2.8ppNeeds LR tuning
Qwen3-8B (full FT)MATH-50036.8%+9.6ppFull FT recovers 1.7B gains (verified: 46.4%)
Qwen3-32BMATH-50068.3%+0.6ppRight tail (baseline too high)
Qwen2.5-Coder-1.5BMBPP5.8%+1.9ppLeft tail (too hard)
Qwen2.5-Coder-7BMBPP66.5%-0.4ppRight tail (too easy)
Kimina-1.5BMiniF2F39.3%-2.4ppDomain exception (Lean)

The sweet spot appears to be 30-55% baseline accuracy for math/code domains. Below ~10%, the model cannot produce correct retries. Above ~65%, there is nothing left to learn from retry.

8B LR Sweep Results

The 8B null result was a hyperparameter artifact. The optimal LR shifts upward with model scale, likely because LoRA updates need larger steps to overcome the model's stronger priors.

lr = 5e-7
+0.0pp
Too low (LoRA frozen)
lr = 1e-6
+2.8pp
Optimal for 8B
lr = 2e-6
+1.0pp
Overshoot
ScaleOptimal LRBest GainRetry RateNote
1.7B5e-7 (default)+8.8pp~35%LoRA r=32 is sufficient
8B (LoRA)1e-6+2.8pp~28%LoRA r=64, 2x LR needed
8B (full FT)1e-6+9.6pp~28%Full FT recovers full gains (verified: 36.8% -> 46.4%)
32B (LoRA)2e-6+0.6pp~12%Baseline 68.3% too high; inverted-U right tail

32B Result: Inverted-U Confirmed

The 32B result (+0.6pp) confirms the inverted-U hypothesis but for a different reason than expected. The issue is NOT LoRA capacity or LR tuning; it is that 32B's baseline (68.3%) is already past the sweet spot. There is nothing left to learn from retry at this performance level.

8B Full FT (CONFIRMED)
Full fine-tuning at 8B recovers 1.7B-level gains. LoRA capacity WAS the constraint. Method works at scale when given enough parameters.
+9.6pp (verified)
32B (inverted-U right tail)
68.3% baseline means most problems are already solved on first attempt. Few failures to retry, few opportunities to learn. The method hits ceiling from above.
+0.6pp (right tail)

Implication: Retry is a method for STRUGGLING models (30-55% baseline). At 32B, the model is too good for retry to help on MATH-500. It might still help on harder benchmarks (AIME, putnam-level) where 32B baseline is lower.

Retry-Count Inverted-U (K sweep)

Increasing retries per problem shows diminishing and then negative returns. K=3 is optimal; K=5 regresses, likely due to overfitting on low-quality late retries.

Retries (K)GainDelta vs K-1Interpretation
K=1+8.8pp--Baseline retry (single attempt after failure)
K=2+9.1pp+0.3ppSlight improvement from extra attempt
K=3 (optimal)+10.1pp+1.0ppPeak gain; best cost/benefit tradeoff
K=5+8.5pp-1.6ppRegression; late retries are low-quality and dilute training signal
K=1
+8.8
K=2
+9.1
K=3 (best)
+10.1
K=5
+8.5

Key insight: Retry count has its own inverted-U. More retries initially help (more chances to find correct solutions) but past K=3, late-retry solutions are low quality (the model has exhausted its good alternative strategies) and training on them dilutes the signal.

Currently Running (3 jobs, Day 3 steady-state 2026-05-11)

JobCategoryStatusExpectedNotes
ughai-opsd-skeleton-s123Lean multi-seedrunning~4hHave-skeleton 3-seed confirmation (seed 123)
ughai-opsd-skeleton-s456Lean multi-seedrunning~4hHave-skeleton 3-seed confirmation (seed 456)
ughai-sd-zero-opsd-hybridMechanism~44h in~48hSD-Zero + OPSD combination (finishing)

Cluster: CMH (us-east-2), p5en-queue. Completed Day 3: 8B full FT (+9.6pp verified), 32B retry (+0.6pp), iterative R3 (-1.6pp), Lean type-checker PI (-1.2pp), Lean failure-diagnosis (-1.2pp), Lean heuristic 3-iter (-3.6pp), HAR math, skeleton-guided retry, skeleton+retry (-2.0pp).