tech
December 17, 2025
Grok-1.5 Vision Preview
Connecting the digital and physical worlds with our first multimodal model.

TL;DR
- Grok-1.5V is a first-generation multimodal model that processes text and visual information.
- It can understand documents, diagrams, charts, screenshots, and photographs.
- Grok-1.5V demonstrates competitive performance against other frontier multimodal models in areas like document understanding and reasoning.
- It outperforms peers on the new RealWorldQA benchmark for real-world spatial understanding.
- The RealWorldQA benchmark, consisting of over 700 images, is released to the community under CC BY-ND 4.0.
- Future developments aim to improve multimodal understanding and generation capabilities across images, audio, and video.
Continue reading
the original article