Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Wenjie Ma1, Andrei Cojocaru1, Neel Kolhe1, Bradley Louie1, Robin Said Sharif1, Haihan Zhang3, Vincent Zhuang2, Matei Zaharia1, Sewon Min1,4
1UC Berkeley  ·  2Google DeepMind  ·  3Peking University  ·  4Allen Institute for AI

Abstract

Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0–7 scale to model-generated math proofs. We introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores. Finally, we demonstrate its practical utility in a best-of-n selection task: at n=16, ProofGrader achieves an average score of 4.14/7, closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62).
0.93
Mean Absolute Error
78%
Gap Closed
145
Problems
435
Annotations

Key Results

Our systematic exploration of the evaluator design space reveals clear trends and produces a high-performing evaluator. Below, we highlight key findings from our experiments and demonstrate ProofGrader's practical utility.

Impact of Design Choices

The left chart shows how different context configurations affect evaluator performance. Using both marking scheme (MS) and reference solutions (Ref) achieves the lowest MAE of 0.964, while providing no context results in poor performance (MAE of 1.680). The right chart demonstrates ProofGrader's practical value: as we increase the number of candidate solutions from 1 to 16, ProofGrader consistently selects high-quality proofs, closely tracking the human oracle and far outperforming binary and baseline methods.

Impact of Context on Evaluator Quality
MS+Ref MS only Ref only None 0.964 1.069 1.330 1.680 MAE

Lower MAE is better. Each bar represents a different context configuration: MS+Ref (marking scheme + reference solutions, MAE=0.964) performs best; MS only (marking scheme only, MAE=1.069) is second; Ref only (reference only, MAE=1.330) is third; None (no context, MAE=1.680) performs worst. Adding context dramatically improves evaluator quality.

Best-of-N Selection Performance
2.0 2.5 3.0 3.5 4.0 4.5 1 4 8 12 16 n (number of candidates) Human Oracle ProofGrader Fine-grained (Ref+MS) Fine-grained (Ref) Fine-grained (None) Binary (Ref) Binary (None) Baseline

Closes 78% of gap to human oracle at n=16

Dataset Insights

The charts below show how different state-of-the-art models perform on ProofBench and the structure of our dataset. Even the best models (OpenAI o3) achieve a mean score of only 2.87/7, highlighting substantial room for improvement in proof generation. Our dataset provides a balanced mix of problems across competitions and comprehensive coverage through multiple model generators.

Generator Performance on ProofBench
2.87 2.53 1.83 o3 Gemini R1 0 1 2 3

OpenAI o3 leads with 2.87/7, followed by Gemini

ProofBench Dataset
435 Solutions 145 Problems 6 Contests 3 Models 0-7 Scale

Expert-annotated fine-grained ratings

76.5% within ±1

ProofGrader predictions align closely with expert scores

0.008 bias

Nearly unbiased evaluation across all test cases

Outperforms pairwise

More efficient than tournament selection strategies

Downstream Validation: Best-of-N Selection

Beyond offline metrics, we validate ProofGrader's practical utility through a best-of-n selection task—a standard proxy for evaluating reward model quality. Given n candidate proofs for each problem, the evaluator must identify the highest-quality solution. This task directly mirrors real-world applications like rejection sampling for data generation and reward modeling for reinforcement learning.

At n=16, ProofGrader achieves an average expert score of 4.14/7 for selected proofs, compared to 2.48 for a naive binary evaluator and 4.62 for the human oracle. This closes 78% of the performance gap, demonstrating that ProofGrader can nearly match human performance in identifying high-quality proofs—a critical capability for training future proof generators.

→ For additional experiments including comparison with tournament-style pairwise selection methods and analysis across different generator models, see Section 4 of the paper.

Why Is This Important?

While LLMs have made remarkable progress in mathematical reasoning, evaluating natural language proofs remains a critical unsolved challenge.

The Core Problem: Many proof problems either don't have a single verifiable final answer, or require assessing the reasoning process itself—not just the conclusion. For example, problems that ask to "prove" a statement without specifying a unique answer, or problems where an incorrect proof might accidentally arrive at the right answer.

Current Approaches Fall Short

Existing methods for proof evaluation each have significant limitations:

Why Reliable Evaluation Matters

A high-quality proof evaluator is essential for advancing the field because it:

→ For a detailed discussion of the limitations of existing approaches and our motivation, see Section 1 of the paper.

ProofBench Dataset

To enable our systematic study, we introduce ProofBench—the first expert-annotated dataset of fine-grained proof ratings spanning multiple prestigious mathematics competitions and recent years.

145
Problems
435
Solutions
6
Competitions
3
SOTA Models

We collected problems from six major mathematics competitions—USAMO, IMO, Putnam, USA TST, APMO, and EGMO—spanning 2022 to 2025. For each problem, we generated solutions using three state-of-the-art reasoning models: OpenAI o3, Gemini-2.5-Pro, and DeepSeek-R1-0528. This selection ensures diversity across both proprietary and open-source model families.

Competitions Covered

USAMO IMO Putnam USA TST APMO EGMO

Years 2022–2025

Expert Annotation Process

Creating reliable ground-truth labels required a rigorous two-stage annotation pipeline:

Stage 1: Marking Scheme Generation. For each problem, we use an LLM to automatically generate a problem-specific marking scheme that identifies key proof steps, assigns point values, and specifies deduction rules. Our expert annotators validated that approximately 85% of these generated schemes were of high quality and suitable for grading.

Stage 2: Expert Grading. Five experts with Putnam-level or national Math Olympiad experience grade each proof on a 0–7 scale. Crucially, experts treat the marking scheme as a detailed reference rather than a rigid checklist, allowing them to fairly credit valid alternative solution approaches. To ensure consistency, 40% of solutions were assigned to multiple annotators, achieving over 90% agreement.

Why 0–7? This scale aligns with the grading standards of premier mathematics competitions (our problem sources), enabling nuanced assessment of partial credit and proof quality—far beyond binary "correct/incorrect" judgments. Our results demonstrate that this fine-grained scoring is essential for effective downstream applications.

Download ProofBench on HuggingFace

→ For complete dataset statistics, model performance breakdowns by competition, and details on the annotation pipeline, see Section 2 of the paper.

Methodology

Rather than training new models, we systematically explore the design space of evaluators to find the best configuration. Our approach is entirely training-free, making it practical and reproducible.

We conduct controlled experiments across four key design axes, testing different combinations to understand what makes an evaluator reliable. Here's a summary of our findings:

Key Takeaway: Our analysis shows that design factors have dramatically different impact magnitudes: backbone modelcontextinstruction style. Based on this, we recommend: (1) use the strongest available reasoning model, (2) always include a problem-specific marking scheme, and (3) match instruction rigidity to the model's capability level.

ProofGrader: Our Best Evaluator

Combining insights from our systematic study, ProofGrader uses OpenAI o3 as the backbone, provides both reference solutions and marking schemes as context, employs flexible instructions that allow for alternative solution paths, and ensembles five independent evaluation runs using median aggregation.

Performance: ProofGrader achieves an MAE of 0.926 against expert scores, with 76.5% of predictions within ±1 point and nearly zero bias (0.008). This significantly outperforms naive baselines and demonstrates strong alignment with human judgment.

→ For detailed ablation studies, statistical significance tests, and analysis of per-generator performance, see Section 3 of the paper.

Summary

This work addresses a critical bottleneck in mathematical proof generation by developing the first reliable, fine-grained evaluation methodology. Our contributions span dataset creation, systematic analysis, and practical validation.

This website provides a high-level overview of our work.
For complete experimental details, additional ablations, related work discussion, and future directions, please refer to the full paper.

Citation

If you use ProofBench or ProofGrader in your research, please cite our paper:

@misc{ma2025reliablefinegrainedevaluationnatural,
      title={Reliable Fine-Grained Evaluation of Natural Language Math Proofs}, 
      author={Wenjie Ma and Andrei Cojocaru and Neel Kolhe and Bradley Louie and Robin Said Sharif and Haihan Zhang and Vincent Zhuang and Matei Zaharia and Sewon Min},
      year={2025},
      eprint={2510.13888},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.13888}, 
}