tech

December 15, 2025

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

TL;DR

  • The FACTS Benchmark Suite is introduced to measure LLM factuality across parametric, search, and multimodal tasks.
  • It includes three new benchmarks: Parametric (internal knowledge), Search (tool use), and Multimodal (image-based questions), plus an updated Grounding benchmark.
  • The suite contains 3,513 curated examples, with evaluation sets managed by Kaggle on a public leaderboard.
  • Gemini 3 Pro achieved the highest overall FACTS Score of 68.8%, showing significant improvements in Search and Parametric benchmarks.
  • All evaluated models scored below 70% accuracy, indicating room for future progress in LLM factuality.

Continue reading
the original article

Made withNostr