ProofGrader: Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Abstract

Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0–7 scale to model-generated math proofs. We introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores. Finally, we demonstrate its practical utility in a best-of-n selection task: at n=16, ProofGrader achieves an average score of 4.14/7, closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62).

Key Results

Our systematic exploration of the evaluator design space reveals clear trends and produces a high-performing evaluator. Below, we highlight key findings from our experiments and demonstrate ProofGrader's practical utility.

Impact of Design Choices

The left chart shows how different context configurations affect evaluator performance. Using both marking scheme (MS) and reference solutions (Ref) achieves the lowest MAE of 0.964, while providing no context results in poor performance (MAE of 1.680). The right chart demonstrates ProofGrader's practical value: as we increase the number of candidate solutions from 1 to 16, ProofGrader consistently selects high-quality proofs, closely tracking the human oracle and far outperforming binary and baseline methods.

Impact of Context on Evaluator Quality

Lower MAE is better. Each bar represents a different context configuration: MS+Ref (marking scheme + reference solutions, MAE=0.964) performs best; MS only (marking scheme only, MAE=1.069) is second; Ref only (reference only, MAE=1.330) is third; None (no context, MAE=1.680) performs worst. Adding context dramatically improves evaluator quality.

Best-of-N Selection Performance

Closes 78% of gap to human oracle at n=16

Dataset Insights

The charts below show how different state-of-the-art models perform on ProofBench and the structure of our dataset. Even the best models (OpenAI o3) achieve a mean score of only 2.87/7, highlighting substantial room for improvement in proof generation. Our dataset provides a balanced mix of problems across competitions and comprehensive coverage through multiple model generators.

Generator Performance on ProofBench

OpenAI o3 leads with 2.87/7, followed by Gemini

ProofBench Dataset

Expert-annotated fine-grained ratings

76.5% within ±1

ProofGrader predictions align closely with expert scores

0.008 bias

Nearly unbiased evaluation across all test cases

Outperforms pairwise

More efficient than tournament selection strategies

Downstream Validation: Best-of-N Selection

Beyond offline metrics, we validate ProofGrader's practical utility through a best-of-n selection task—a standard proxy for evaluating reward model quality. Given n candidate proofs for each problem, the evaluator must identify the highest-quality solution. This task directly mirrors real-world applications like rejection sampling for data generation and reward modeling for reinforcement learning.

At n=16, ProofGrader achieves an average expert score of 4.14/7 for selected proofs, compared to 2.48 for a naive binary evaluator and 4.62 for the human oracle. This closes 78% of the performance gap, demonstrating that ProofGrader can nearly match human performance in identifying high-quality proofs—a critical capability for training future proof generators.

→ For additional experiments including comparison with tournament-style pairwise selection methods and analysis across different generator models, see Section 4 of the paper.

Why Is This Important?

While LLMs have made remarkable progress in mathematical reasoning, evaluating natural language proofs remains a critical unsolved challenge.

The Core Problem: Many proof problems either don't have a single verifiable final answer, or require assessing the reasoning process itself—not just the conclusion. For example, problems that ask to "prove" a statement without specifying a unique answer, or problems where an incorrect proof might accidentally arrive at the right answer.

Current Approaches Fall Short

Existing methods for proof evaluation each have significant limitations:

Human grading is accurate but slow and expensive, making it impractical for training models at scale.
Formal verification (e.g., Lean) offers certainty but is detached from the natural language used in education and research. Automatically translating natural-language proofs into formal systems remains extremely challenging.
LLM-as-judge approaches show promise but lack reliability—their judgments are highly sensitive to model choice, prompt design, and rubric construction, with no principled methodology for optimization.

Why Reliable Evaluation Matters

A high-quality proof evaluator is essential for advancing the field because it:

Provides faithful assessments of model capabilities for benchmarking progress
Enables accurate reward signals for training proof generators via reinforcement learning
Supports downstream applications like best-of-n selection and rejection sampling for data generation

→ For a detailed discussion of the limitations of existing approaches and our motivation, see Section 1 of the paper.

ProofBench Dataset

To enable our systematic study, we introduce ProofBench—the first expert-annotated dataset of fine-grained proof ratings spanning multiple prestigious mathematics competitions and recent years.

145

Problems

435

Solutions

Competitions

SOTA Models

We collected problems from six major mathematics competitions—USAMO, IMO, Putnam, USA TST, APMO, and EGMO—spanning 2022 to 2025. For each problem, we generated solutions using three state-of-the-art reasoning models: OpenAI o3, Gemini-2.5-Pro, and DeepSeek-R1-0528. This selection ensures diversity across both proprietary and open-source model families.

Competitions Covered

USAMO IMO Putnam USA TST APMO EGMO

Years 2022–2025

Expert Annotation Process

Creating reliable ground-truth labels required a rigorous two-stage annotation pipeline:

Stage 1: Marking Scheme Generation. For each problem, we use an LLM to automatically generate a problem-specific marking scheme that identifies key proof steps, assigns point values, and specifies deduction rules. Our expert annotators validated that approximately 85% of these generated schemes were of high quality and suitable for grading.

Stage 2: Expert Grading. Five experts with Putnam-level or national Math Olympiad experience grade each proof on a 0–7 scale. Crucially, experts treat the marking scheme as a detailed reference rather than a rigid checklist, allowing them to fairly credit valid alternative solution approaches. To ensure consistency, 40% of solutions were assigned to multiple annotators, achieving over 90% agreement.

Why 0–7? This scale aligns with the grading standards of premier mathematics competitions (our problem sources), enabling nuanced assessment of partial credit and proof quality—far beyond binary "correct/incorrect" judgments. Our results demonstrate that this fine-grained scoring is essential for effective downstream applications.

Download ProofBench on HuggingFace

→ For complete dataset statistics, model performance breakdowns by competition, and details on the annotation pipeline, see Section 2 of the paper.

Methodology

Rather than training new models, we systematically explore the design space of evaluators to find the best configuration. Our approach is entirely training-free, making it practical and reproducible.

We conduct controlled experiments across four key design axes, testing different combinations to understand what makes an evaluator reliable. Here's a summary of our findings:

1. Backbone Model: We tested six different LLMs as evaluator backbones, ranging from GPT-4o to o3. Our experiments reveal that the choice of backbone is the single most important factor—stronger reasoning models consistently produce better evaluations across all metrics. OpenAI o3 emerged as the best performer.
2. Context Design: We compared four context configurations: providing both reference solutions and marking schemes, providing only one or the other, or providing neither. The marking scheme proves crucial—it provides the majority of the performance gain and helps the evaluator understand what constitutes progress in a proof, preventing it from over-crediting fluent but incorrect reasoning.
3. Instruction Style: We tested three instruction strategies for how evaluators should use the provided context. Interestingly, the optimal instruction style depends on the backbone's capabilities: stronger models (like o3) benefit from flexible instructions that allow them to recognize valid alternative approaches, while mid-tier models require more prescriptive guidance to avoid over-crediting.
4. Workflow Design: Beyond single-pass evaluation, we explored ensembling (running the evaluator multiple times and aggregating scores) and multi-stage pipelines (decomposing evaluation into sequential steps). We found that simple ensembling—taking the median of five independent runs—consistently improves accuracy while reducing variance.

Key Takeaway: Our analysis shows that design factors have dramatically different impact magnitudes: backbone model ≫ context ≫ instruction style. Based on this, we recommend: (1) use the strongest available reasoning model, (2) always include a problem-specific marking scheme, and (3) match instruction rigidity to the model's capability level.

ProofGrader: Our Best Evaluator

Combining insights from our systematic study, ProofGrader uses OpenAI o3 as the backbone, provides both reference solutions and marking schemes as context, employs flexible instructions that allow for alternative solution paths, and ensembles five independent evaluation runs using median aggregation.

Performance: ProofGrader achieves an MAE of 0.926 against expert scores, with 76.5% of predictions within ±1 point and nearly zero bias (0.008). This significantly outperforms naive baselines and demonstrates strong alignment with human judgment.

→ For detailed ablation studies, statistical significance tests, and analysis of per-generator performance, see Section 3 of the paper.

Summary

This work addresses a critical bottleneck in mathematical proof generation by developing the first reliable, fine-grained evaluation methodology. Our contributions span dataset creation, systematic analysis, and practical validation.

ProofBench Dataset: We introduce the first expert-annotated benchmark for fine-grained proof evaluation, containing 145 problems from six major mathematics competitions (2022–2025) and 435 solutions from state-of-the-art models (o3, Gemini-2.5-Pro, DeepSeek-R1). Each solution is graded on a 0–7 scale by experts with olympiad-level experience, following problem-specific marking schemes.
Systematic Design Space Exploration: We conduct the first comprehensive, training-free study of evaluator design factors. By systematically varying backbone models, context configurations, instruction styles, and workflows, we identify which factors matter most and establish actionable design principles for building reliable evaluators.
ProofGrader Evaluator: Combining insights from our study, we present ProofGrader—an evaluator that achieves MAE of 0.926 against expert scores. It combines a strong reasoning backbone (o3), rich context (reference solutions + marking schemes), flexible instructions, and simple ensembling for robustness.
Practical Validation: We demonstrate ProofGrader's utility beyond offline metrics through best-of-n selection experiments. At n=16, it closes 78% of the gap between a naive binary evaluator and human oracle performance, highlighting its potential as a reward signal for training proof generators.

This website provides a high-level overview of our work.
For complete experimental details, additional ablations, related work discussion, and future directions, please refer to the full paper.

Abstract

Key Results

Impact of Design Choices

Dataset Insights

Downstream Validation: Best-of-N Selection

Why Is This Important?

Current Approaches Fall Short

Why Reliable Evaluation Matters

ProofBench Dataset

Competitions Covered

Expert Annotation Process

Methodology

ProofGrader: Our Best Evaluator

Summary

Citation