Statistical Reasoning Benchmark

StatEval: A Comprehensive Benchmark for Large Language Models in Statistics

StatEval evaluates large language models across textbook-level statistical knowledge and frontier research-level proof tasks, with a TRACE pipeline for extracting self-contained theorem reasoning data from statistical literature.

Comprehensive Evaluation Leaderboard

Latest results from StatEval-mini, covering foundational statistical knowledge and research-level proof reasoning.

1,000Foundational evaluation items
900Research proof variants
9Models in latest comparison
2024-2025Research test articles
Rank Model Grad Prob. Grad Stat. Grad ML Grad Mean Undergrad Prob. Undergrad Stat. Undergrad ML Undergrad Mean Overall

Scores are percentages. Foundational tasks include multiple-choice, short-answer, calculation, fill-in-the-blank, and proof-based problems.

Benchmark Structure

StatEval is organized along two axes: difficulty level and statistical discipline.

Foundational Knowledge Dataset

  • 22,262 problems from 76 textbooks, exams, course materials, and online resources.
  • 9,382 undergraduate and 12,880 graduate-level instances.
  • Probability, Statistics, and Machine Learning, with more detailed course-level subdomains.

Statistical Research Dataset

  • 84,179 proof-based tasks from 6,953 articles published between 2000 and 2025.
  • Six top-tier statistical and machine learning journals.
  • Easy, Medium, and Hard variants generated from theorem dependency structures.
StatEval overview
Overall benchmark overview Two complementary branches cover curriculum-level statistical knowledge and frontier research-level derivations.
Foundational dataset composition
Foundational composition Distribution across educational levels, statistical domains, and problem formats.
Research dataset composition
Research composition Coverage across research subfields and theoretical property categories.

Problem Examples

Examples are included to make the benchmark format concrete, especially the difference between routine statistical exercises and research-level theorem reasoning.

Foundational tasks

Textbook-style questions test statistical definitions, calculations, modeling judgment, and proof-based reasoning across undergraduate and graduate curricula.

Research Easy

Prerequisite theorems or lemmas are provided as facts, so the model focuses on proving the target result.

Research Medium and Hard

Medium requires proving prerequisites before the main result; Hard removes prerequisite hints and requires building the proof chain independently.

Foundational problem example
Foundational example Representative curriculum-level statistical reasoning item.
Research problem variants
Research-level variants A theorem instantiated under Easy, Medium, and Hard settings.

TRACE Data Processing Pipeline

TRACE combines deterministic structural extraction with LLM-based semantic validation to convert unstructured statistical papers into self-contained theorem-level reasoning tasks.

1. Convert and segment

PDFs are normalized into structured Markdown, then theorems, assumptions, equations, and proofs are isolated with context-aware markers.

2. Harmonize notation

Document-level notation and surrounding assumptions are consolidated so each extracted problem remains self-contained.

3. Build dependencies

Prerequisite theorems, lemmas, definitions, and equations are organized into a topological dependency graph.

4. Generate difficulties

Easy, Medium, and Hard variants are synthesized by changing how much dependency information is provided.

5. Patch carefully

Context-Aware Patching resolves local OCR artifacts or missing intermediate steps under a Zero-Deletion Policy.

6. Validate quality

Completeness, sufficiency, and consistency checks are followed by human expert review on sampled cases.

TRACE pipeline
TRACE pipeline overview Six modules cover conversion, segmentation, base problem generation, dependency parsing, multi-difficulty synthesis, and validation.
Distributed proof case
Distributed proof extraction TRACE decouples upper-bound and lower-bound proof fragments while preserving logical structure.
Notation harmonization case
Notation harmonization Global notation prevents symbols from being misinterpreted when theorem blocks are extracted.
Context-aware patching case
Context-Aware Patching Local expansions repair typographical artifacts and logical jumps without rewriting the original proof strategy.

Adaptive Evaluation Protocol

Multiple-choice questions are graded by exact matching, while open-ended statistical derivations are scored through an adaptive process-based pipeline.

Logic alignment

A selector model routes each response according to whether it follows the reference proof strategy or uses a divergent but potentially valid path.

Reference-based scoring

Aligned solutions are mapped onto atomic proof steps and receive normalized partial credit for completed logical components.

Independent verification

Divergent solutions are judged by logical coherence, technical precision, and terminal accuracy when applicable.

Adaptive evaluation pipeline
Scoring pipeline Responses are routed to reference-based step verification or independent path verification.
Evaluation case study
Evaluation case A compact example showing how aligned and divergent proof attempts are scored.