tech
February 5, 2026
Evals
SimpleQA Verified is a 1,000-prompt benchmark for reliably evaluating Large Language Models (LLMs) on short-form factuality and parametric knowledge. The authors from Google DeepMind and Google Research address various limitations of SimpleQA, originally designed by Wei et al. (2024) at OpenAI, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created to provide the research community with a more precise instrument to track genuine progress in factuality, discourage overfitting to benchmark artifacts, and ultimately foster the development of more trustworthy AI systems.
TL;DR
- SimpleQA Verified is a 1,000-prompt benchmark for evaluating LLM short-form factuality and parametric knowledge, addressing limitations of the original SimpleQA.
- FACTS Grounding evaluates LLM ability to generate factually accurate responses grounded in provided long-form documents.
- The FACTS Benchmark suite holistically evaluates LLM factuality across parametric knowledge, search, multimodality, and grounding.
- DeepSearchQA is a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks with a 'causal chain' structure.
- The Chess Text Input Leaderboard provides a framework for evaluating LLMs' strategic reasoning capabilities in chess.
- The Chess Text Openings Leaderboard evaluates LLMs' strategic reasoning from specific early-game chess positions.
Continue reading the original article