tech
December 8, 2025
Measuring the performance of our models on real-world tasks
We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations.

TL;DR
- GDPval evaluates AI model performance on 1,320 specialized tasks across 44 occupations in 9 key U.S. industries.
- Tasks are based on real work products and vetted by experienced professionals, reflecting realistic knowledge work.
- The evaluation aims to provide evidence-based insights into AI's progress and potential for assisting human professionals.
- Early results show top AI models approaching, and in some cases matching or exceeding, the quality of expert-produced work.
- Frontier models can complete GDPval tasks significantly faster and cheaper than human experts, but human oversight is still required.
- Future versions of GDPval will expand scope to include more occupations, interactive workflows, and tasks involving ambiguity.
Continue reading
the original article