Justin Salamon
justinsalamon.bsky.social
Justin Salamon
@justinsalamon.bsky.social
Head of Sound Design AI Research at Adobe. Machine learning and signal processing for audio & video. Musician. He/him.
www.justinsalamon.com
To learn more please check out our ICML'25 paper: "FLAM: Frame-Wise Language-Audio Modeling"
arxiv.org/abs/2505.053...

Big congratulations to Yusong Yu, @tsirif.bsky.social and the whole team from @adobe.com research, @mit.edu and @mila-quebec.bsky.social
FLAM: Frame-Wise Language-Audio Modeling
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improv...
arxiv.org
June 24, 2025 at 7:30 PM
FLAM is trained jointly on instance (global) and frame-wise (local) objectives.

The secret sauce: A memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training
June 24, 2025 at 7:28 PM
Enter FLAM: Frame-Wise Language-Audio Modeling.

A model trained to produce a calibrated likelihood for *any* text prompt.

FLAM outperforms prior self-supervised models on both closed-set and open-set SED, while preserving strong retrieval and zero-shot classification accuracy
June 24, 2025 at 7:27 PM
Our goal is for the model to detect *any* sound via free form text queries.

"So use CLAP", some of you will say.

The problem is its output likelihoods are not calibrated for different prompts :(

That's ok ranked retrieval, but for detection it's a no go.
June 24, 2025 at 7:27 PM
Sound Event Detection models, ie finding sounds in audio/video recordings, are typically constrained to a predefined "closed" set of sounds, like in this (old!) model below for urban sound detection.

It has some applications, but it doesn't address general purpose sound search.
June 24, 2025 at 7:27 PM
Here's another example of work from our group:

MultiFoley, a Video-to-Audio model that generates perfectly synced audio for video at 48 kHz and supports multimodal conditioning.

More on MultiFoley here: bsky.app/profile/czya...
December 9, 2024 at 7:04 PM