StatEval — Benchmarking Statistical Reasoning in Large Language Models
StatEval, developed by the team of Professor Fan Zhou, is the first benchmark systematically organized along both difficulty and disciplinary axes to evaluate large language models’ statistical reasoning.
It includes a Foundational Knowledge Dataset comprising exactly 22,262 problems (9,382 undergraduate and 12,880 graduate instances) curated from 76 classical textbooks and extensive exam collections.
Furthermore, it features a Statistical Research Dataset consisting of 84,179 proof-based tasks derived from 6,953 high-impact research articles (published between 2000 and 2025). These tasks are categorized by derivation difficulty into 40,366 Easy, 22,013 Medium, and 21,800 Hard problems.
A representative partial test set (Demo) is publicly available and can be accessed on Hugging Face