Grok-1.5 Vision Preview

December 17, 2025

TL;DR

Grok-1.5V is a first-generation multimodal model that processes text and visual information.
It can understand documents, diagrams, charts, screenshots, and photographs.
Grok-1.5V demonstrates competitive performance against other frontier multimodal models in areas like document understanding and reasoning.
It outperforms peers on the new RealWorldQA benchmark for real-world spatial understanding.
The RealWorldQA benchmark, consisting of over 700 images, is released to the community under CC BY-ND 4.0.
Future developments aim to improve multimodal understanding and generation capabilities across images, audio, and video.

Continue reading
the original article