Graded Verifier vs Binary Filter

Partial credit (graded verifier, weighted SFT) gives +4.2pp, while strict binary filtering gives +12.9pp. The binary filter is 3x better. Near-miss solutions actively dilute the training signal.

+4.2pp (graded) vs +12.9pp (binary) = 3x gap
MATH PAPER-CRITICAL

Hypothesis

Intuition suggests that "near-miss" solutions (those that get most steps right but fail at the final answer) should be valuable training data. A graded verifier that assigns partial credit should outperform a harsh binary filter that discards everything except perfectly correct solutions.

Expected: Graded >= Binary (more data should help, partial credit captures useful reasoning).

Actual: Binary is 3x better. Partial credit is noise, not signal.

Binary filter (pass/fail only)
+12.9pp
Only verified-correct solutions
Graded verifier (partial credit)
+4.2pp
Near-miss included with weights

Method

Binary Filter (standard STaR)

  1. Generate retry solutions for failed problems
  2. Check final answer against gold: PASS or FAIL (no middle ground)
  3. Keep only PASS solutions
  4. SFT with uniform weight on all kept solutions

Graded Verifier (partial credit)

  1. Generate retry solutions for failed problems
  2. Score each solution on a 0-1 scale: (a) final answer correct = 1.0, (b) intermediate steps correct but final wrong = 0.5-0.8, (c) approach correct but execution wrong = 0.3-0.5, (d) completely wrong = 0.0
  3. Keep solutions scoring above 0.3 (includes near-misses)
  4. SFT with loss weighted by the score (higher-scoring solutions get more weight)

Configuration

ModelQwen3-1.7B
DatasetNuminaMath-CoT-10k
Eval benchmarkMATH-500 (pass@1)
Training steps500
Learning rate2e-5
LoRA rank16
Seed42
ScoringHeuristic partial credit
Hardware1x H200 (p5en.48xl)
Runtime~3h each condition

Results

ConditionTraining examplesBaselinePost-trainingDelta
Binary filter~3,200 (only correct)40.1%53.0%+12.9pp
Graded verifier~5,500 (includes near-miss)40.1%44.3%+4.2pp
Gap3x worse

More data can hurt. The graded verifier produces 72% more training examples, but 3x worse results. Near-miss solutions contain subtle errors that propagate during SFT. The model learns incorrect reasoning patterns from "almost right" solutions. Binary filtering is a FEATURE: it ensures every training example is verified correct, preventing noise injection.

Why Partial Credit Hurts

Training Curves

Binary: loss drops cleanly, eval improves steadily. Graded: loss drops faster (more data) but eval improvement stalls after step 200. The model fits the mixed-quality data but does not generalize as well.

Logs at: /data/ughai-sandbox/opsd_experiments/graded_vs_binary/

Interpretation

This result has a clear practical implication: always use the harshest possible filter. In any self-improvement pipeline:

This is Level 4 of the mechanism hierarchy: graded verification is actively harmful compared to binary filtering.

Connection to Other Experiments

Gold-Answer STaR (+12.0pp, 3-seed mean) - confirms binary is optimal
Gold STaR uses binary filter (correct/incorrect verification). Its strong results confirm: the filter, not the PI, drives quality.
Rejection Sampling (-2.0pp) - extreme case of dilution
Training on ALL correct solutions (even easy ones) hurts. The graded verifier is a milder version of the same problem: including low-quality examples dilutes the training signal.
4-Level Mechanism Hierarchy - Level 4 confirmed
Graded verification is the bottom level: it introduces noise that binary filtering avoids. The hierarchy is now fully validated: Level 1 (retry+binary) >> Level 2 (more samples) >> Level 3 (OPSD) >> Level 4 (graded).