tech
December 15, 2025
FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality
The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.
TL;DR
- The FACTS Benchmark Suite is introduced to measure LLM factuality across parametric, search, and multimodal tasks.
- It includes three new benchmarks: Parametric (internal knowledge), Search (tool use), and Multimodal (image-based questions), plus an updated Grounding benchmark.
- The suite contains 3,513 curated examples, with evaluation sets managed by Kaggle on a public leaderboard.
- Gemini 3 Pro achieved the highest overall FACTS Score of 68.8%, showing significant improvements in Search and Parametric benchmarks.
- All evaluated models scored below 70% accuracy, indicating room for future progress in LLM factuality.
Continue reading
the original article