tech
December 5, 2025
Jina-VLM: Small Multilingual Vision Language Model
We're releasing jina-vlm, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. By combining a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector, jina-vlm delivers strong performance across 29 languages while remaining efficient enough to run on consumer hardware.

TL;DR
- Jina-VLM is a 2.4B parameter vision-language model achieving state-of-the-art multilingual visual question answering among open 2B-scale VLMs.
- It combines a SigLIP2 vision encoder with a Qwen3 language backbone via an attention-pooling connector for token-efficient, arbitrary-resolution image processing.
- The model demonstrates strong performance across 29 languages and is efficient enough for consumer hardware.
- Key architectural innovation is the attention-pooling connector, reducing visual tokens by 4x with minimal performance impact.
- A two-stage training pipeline, including text-only data, preserves multilingual capabilities and avoids catastrophic forgetting.
- Limitations include potential tiling overhead for very high-resolution images and weaker performance on multi-image reasoning tasks.
- Future work aims to improve resolution handling, spatial tasks, and explore multilingual training for larger models.
Continue reading
the original article