tech
March 11, 2026
Bootstrapping Audio Embeddings from Multimodal LLMs
Google recently released Gemini Embedding 2, their first natively multimodal embedding model. Text, images, video, audio, documents, all mapped into a single 3072-dimensional vector space. This is part of a broader trend toward omni embedding models: unified models that handle all modalities in one architecture, from jina-embeddings-v4 to Omni-Embed-Nemotron to Omni-5.

TL;DR
- Multimodal LLMs can be transformed into compact audio embedding models, outperforming CLAP with significantly less data.
- The proposed 'module combination' approach combines audio encoders and LLMs from different models and training stages for efficient bootstrapping.
- Starting from a pretrained MLLM provides cross-modal alignment and strong encoders, reducing data requirements.
- The method shows promise for developing unified omni embedding models that handle multiple data modalities.
- Audio embeddings have diverse applications including agent intent routing, real-time monitoring, and multimodal agent workflows.
- Challenges remain in generalizing to abstract audio descriptions, and cross-modal transfer does not survive aggressive model compression.
Continue reading the original article