Measuring the performance of our models on real-world tasks

December 8, 2025

TL;DR

GDPval evaluates AI model performance on 1,320 specialized tasks across 44 occupations in 9 key U.S. industries.
Tasks are based on real work products and vetted by experienced professionals, reflecting realistic knowledge work.
The evaluation aims to provide evidence-based insights into AI's progress and potential for assisting human professionals.
Early results show top AI models approaching, and in some cases matching or exceeding, the quality of expert-produced work.
Frontier models can complete GDPval tasks significantly faster and cheaper than human experts, but human oversight is still required.
Future versions of GDPval will expand scope to include more occupations, interactive workflows, and tasks involving ambiguity.

Continue reading
the original article