tech

December 8, 2025

Measuring the performance of our models on real-world tasks

We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations.

Measuring the performance of our models on real-world tasks

TL;DR

  • GDPval evaluates AI model performance on 1,320 specialized tasks across 44 occupations in 9 key U.S. industries.
  • Tasks are based on real work products and vetted by experienced professionals, reflecting realistic knowledge work.
  • The evaluation aims to provide evidence-based insights into AI's progress and potential for assisting human professionals.
  • Early results show top AI models approaching, and in some cases matching or exceeding, the quality of expert-produced work.
  • Frontier models can complete GDPval tasks significantly faster and cheaper than human experts, but human oversight is still required.
  • Future versions of GDPval will expand scope to include more occupations, interactive workflows, and tasks involving ambiguity.

Continue reading
the original article

Made withNostr