StatEval

Comprehensive Evaluation Leaderboard

Latest results from StatEval-mini, covering foundational statistical knowledge and research-level proof reasoning.

1,000Foundational evaluation items

900Research proof variants

9Models in latest comparison

2024-2025Research test articles

Rank	Model	Grad Prob.	Grad Stat.	Grad ML	Grad Mean	Undergrad Prob.	Undergrad Stat.	Undergrad ML	Undergrad Mean	Overall

Scores are percentages. Foundational tasks include multiple-choice, short-answer, calculation, fill-in-the-blank, and proof-based problems.

Rank	Model	Easy	Medium	Hard	Mean

Research tasks are proof-based theorem reasoning problems sampled from recent statistical papers.

Benchmark Structure

StatEval is organized along two axes: difficulty level and statistical discipline.

Foundational Knowledge Dataset

22,262 problems from 76 textbooks, exams, course materials, and online resources.
9,382 undergraduate and 12,880 graduate-level instances.
Probability, Statistics, and Machine Learning, with more detailed course-level subdomains.

Statistical Research Dataset

84,179 proof-based tasks from 6,953 articles published between 2000 and 2025.
Six top-tier statistical and machine learning journals.
Easy, Medium, and Hard variants generated from theorem dependency structures.

Overall benchmark overview Two complementary branches cover curriculum-level statistical knowledge and frontier research-level derivations.

Foundational composition Distribution across educational levels, statistical domains, and problem formats.

Research composition Coverage across research subfields and theoretical property categories.

Problem Examples

Examples are included to make the benchmark format concrete, especially the difference between routine statistical exercises and research-level theorem reasoning.

Foundational tasks

Textbook-style questions test statistical definitions, calculations, modeling judgment, and proof-based reasoning across undergraduate and graduate curricula.

Research Easy

Prerequisite theorems or lemmas are provided as facts, so the model focuses on proving the target result.

Research Medium and Hard

Medium requires proving prerequisites before the main result; Hard removes prerequisite hints and requires building the proof chain independently.

Foundational example Representative curriculum-level statistical reasoning item.

Research-level variants A theorem instantiated under Easy, Medium, and Hard settings.

TRACE Data Processing Pipeline

TRACE combines deterministic structural extraction with LLM-based semantic validation to convert unstructured statistical papers into self-contained theorem-level reasoning tasks.

1. Convert and segment

PDFs are normalized into structured Markdown, then theorems, assumptions, equations, and proofs are isolated with context-aware markers.

2. Harmonize notation

Document-level notation and surrounding assumptions are consolidated so each extracted problem remains self-contained.

3. Build dependencies

Prerequisite theorems, lemmas, definitions, and equations are organized into a topological dependency graph.

4. Generate difficulties

Easy, Medium, and Hard variants are synthesized by changing how much dependency information is provided.

5. Patch carefully

Context-Aware Patching resolves local OCR artifacts or missing intermediate steps under a Zero-Deletion Policy.

6. Validate quality

Completeness, sufficiency, and consistency checks are followed by human expert review on sampled cases.

TRACE pipeline overview Six modules cover conversion, segmentation, base problem generation, dependency parsing, multi-difficulty synthesis, and validation.

Distributed proof extraction TRACE decouples upper-bound and lower-bound proof fragments while preserving logical structure.

Notation harmonization Global notation prevents symbols from being misinterpreted when theorem blocks are extracted.

Context-Aware Patching Local expansions repair typographical artifacts and logical jumps without rewriting the original proof strategy.

Adaptive Evaluation Protocol

Multiple-choice questions are graded by exact matching, while open-ended statistical derivations are scored through an adaptive process-based pipeline.

Logic alignment

A selector model routes each response according to whether it follows the reference proof strategy or uses a divergent but potentially valid path.

Reference-based scoring

Aligned solutions are mapped onto atomic proof steps and receive normalized partial credit for completed logical components.

Independent verification

Divergent solutions are judged by logical coherence, technical precision, and terminal accuracy when applicable.

Scoring pipeline Responses are routed to reference-based step verification or independent path verification.

Evaluation case A compact example showing how aligned and divergent proof attempts are scored.