tech

March 11, 2026

Bootstrapping Audio Embeddings from Multimodal LLMs

Google recently released Gemini Embedding 2, their first natively multimodal embedding model. Text, images, video, audio, documents, all mapped into a single 3072-dimensional vector space. This is part of a broader trend toward omni embedding models: unified models that handle all modalities in one architecture, from jina-embeddings-v4 to Omni-Embed-Nemotron to Omni-5.

Bootstrapping Audio Embeddings from Multimodal LLMs

TL;DR

  • Multimodal LLMs can be transformed into compact audio embedding models, outperforming CLAP with significantly less data.
  • The proposed 'module combination' approach combines audio encoders and LLMs from different models and training stages for efficient bootstrapping.
  • Starting from a pretrained MLLM provides cross-modal alignment and strong encoders, reducing data requirements.
  • The method shows promise for developing unified omni embedding models that handle multiple data modalities.
  • Audio embeddings have diverse applications including agent intent routing, real-time monitoring, and multimodal agent workflows.
  • Challenges remain in generalizing to abstract audio descriptions, and cross-modal transfer does not survive aggressive model compression.

Continue reading the original article

Made withNostr