Introducing SAM Audio: The First Unified Multimodal Model for Audio Separation

January 28, 2026

TL;DR

SAM Audio is a new unified multimodal model for audio separation using text, visual, or time segment prompts.
It is powered by the Perception Encoder Audiovisual (PE-AV) engine, an advancement of the Perception Encoder model.
SAM Audio-Bench is introduced as the first in-the-wild audio separation benchmark.
SAM Audio Judge is a new automatic model for evaluating audio separation quality based on perceptual criteria.
The model offers state-of-the-art performance, faster-than-real-time processing, and supports multimodal prompting.
Limitations include no support for audio as a prompt and challenges in separating highly similar sounds.

Continue reading the original article