StatEval — Benchmarking Statistical Reasoning in Large Language Models

February 15, 2026 April 21, 2026

We are excited to announce StatEval, the first benchmark systematically organized along both difficulty and disciplinary axes to evaluate large language models’ statistical reasoning, developed by the team of Professor Fan Zhou at Shanghai University of Finance and Economics.

StatEval includes two carefully curated datasets:

Foundational Knowledge Dataset — over 13,000 problems sourced from 50+ textbooks, covering the full spectrum of foundational statistical knowledge.
Statistical Research Dataset — over 2,000 proof-based questions collected from 18 top-tier journals in statistics, probability, econometrics, and machine learning.

Both test sets are publicly available and can be accessed on Hugging Face:

StatEval-Foundational-Knowledge
StatEval-Statistical-Research
StatEval Website