3-day research sprint (May 8-10, 2026) — 60+ experiments, 4 domains, 3 model scales
We set out to discover whether Privileged Information (PI) from a teacher/oracle can accelerate student LLM training. The answer: the information does not matter. What matters is the one-bit failure signal ("you got it wrong, try again") combined with a binary verifier to filter correct retries into SFT data. This simple pipeline yields +8-12pp on math, +23pp on code, and compounds across iterations.
Question: Can teacher-provided hints, answers, or structured guidance (Privileged Information) help a student LLM learn faster than it could from its own trial-and-error?
Why it matters: If PI content drives learning, we need expensive oracle systems (strong models, proof assistants, execution environments) to generate high-quality guidance. If it does not, the self-improvement recipe simplifies dramatically: just a verifier and a retry prompt.
Setup: Qwen3-1.7B/8B on MATH-500 (primary), Qwen2.5-Coder-1.5B/7B on HumanEval/MBPP, Kimina-1.5B on miniF2F (Lean). We compare gold-answer STaR, gibberish-target STaR, wrong-answer STaR, and bare "try again" STaR, all with 3-seed confidence intervals.
Four PI conditions, all statistically indistinguishable (Qwen3-1.7B, MATH-500, 500 steps, 3 seeds each):
| Condition | PI Given to Student | Gain (pp) | 95% CI | Interpretation |
|---|---|---|---|---|
| SD-Zero (gold answer) | Correct final answer | +12.0 | ± 0.7 | Ceiling (oracle) |
| Gibberish "XYZZY" | Nonsense string | +10.8 | ± 1.7 | Same as gold |
| Wrong answers | Shuffled (incorrect) answers | +9.8 | ± 0.1 | Tightest CI; PI irrelevant |
| Bare "try again" | Nothing (just retry prompt) | +8.8 | ± 1.4 | No information needed |
All CIs overlap. The slight ordering (gold > gibberish > wrong > bare) is not statistically significant. The 3pp spread across conditions is smaller than within-condition variance for most.
The critical control experiment: instead of retrying failed problems, we generate N=16 independent first-attempt solutions and filter correct ones into SFT. This matches compute but removes the failure signal.
Interpretation: When the model knows it failed, its retry distribution shifts toward novel strategies it would not explore from scratch. This is not just "more samples" or "sample diversity"; it is a qualitatively different generative mode triggered by awareness of failure. The failure signal concentrates probability mass on unexplored solution paths.
| Control | Result | What It Rules Out |
|---|---|---|
| Frontier-rejection N=16 | +1.2pp | Not just "harder problems" (same frontier, no failure signal) |
| Temperature schedule (no failure) | +1.2pp | Not diversity alone (varied temps without failure awareness) |
| Uniform retry T=0.7 x8 | +2.6pp | Not repeated sampling (8 tries without failure framing still 4x worse) |
| Graded verifier (partial credit) | +4.2pp | Binary filter is better; near-miss solutions dilute signal |
| Binary filter (standard retry) | +12.9pp | Binary pass/fail is a feature, not a limitation |
Math: +8.8pp (1.7B), +23.2pp code (HumanEval). These domains have high path diversity; many correct solutions exist and the model can explore them after failure.
Lean theorem proving: retry gives -2.4pp (heuristic verifier) or -1.2pp (type-checker). Iterating makes it worse: 3 rounds reach -3.6pp. Even compiler error diagnostics do not help.
What works instead: OPSD with have-skeleton PI (subgoal decomposition) gives +3.3pp. Lean has low path diversity (proofs are structurally constrained) and strict verification. The model cannot explore novel paths by retrying; it needs structural scaffolding to know where to go.
Takeaway: Retry works when (1) path diversity is high and (2) the verifier is reliable. Lean violates both: narrow paths + noisy heuristic verifiers compound errors.
Retry gain depends on baseline competence. Too weak (cannot solve anything on retry) or too strong (nothing left to solve) both yield zero gain. The sweet spot is 30-55% baseline pass rate.
LoRA at 8B is nearly inert (+0.0 to +2.8pp depending on LR). The rank-limited adapter cannot shift the model's distribution sufficiently. Full fine-tuning restores the effect (~+11.8pp provisional), confirming this is a plasticity bottleneck, not a fundamental scaling limit.
The 32B result (+0.6pp) is competence saturation (baseline 68.3%), NOT a scaling ceiling. A 32B model on a harder benchmark where its baseline is 30-55% would likely show large gains.