Jonathan Lorraine
banner
jonlorraine.bsky.social
Jonathan Lorraine
@jonlorraine.bsky.social
Research scientist @NVIDIA | PhD in machine learning @UofT. Previously @Google / @MetaAI. Opinions are my own. 🤖 💻 ☕️
Apply here: nvidia.eightfold.ai/careers?star...

I'm personally interested in multimodal generation and the tools that power it.
NVIDIA 2026 Internships: PhD Generative AI Research - US | NVIDIA Corporation
By submitting your resume, you're expressing interest in one of our 2026 Generative AI focused Research Internships. We'll review resumes on an ongoing basis, and a recruiter may reach out if your exp...
nvidia.eightfold.ai
October 9, 2025 at 4:57 AM
We find a new set of use cases for Stable Audio Open ( @jordiponsdotme.bsky.social, @stabilityai.bsky.social, @hf.co) and other large pretrained audio generative models, like AudioLDM and beyond!
May 9, 2025 at 4:06 PM
Our work is inspired by and builds on the SDS update of DreamFusion (dreamfusion3d.github.io/, @benmpoole.bsky.social , @ajayjain9.bsky.social , @jonbarron.bsky.social), and related updates (VSD, SDI @vincentsitzmann.bsky.social, SJC, many more!)
May 9, 2025 at 4:06 PM
💡 SDS treats any differentiable parameter set as optimizable from a prompt. Source-guided separation emerged when we brainstormed novel uses. We hope for similarly practical tasks to surface—e.g., automatic Foley layering?—as the community experiments.
May 9, 2025 at 4:06 PM
🚀 Vision of the Future: Content designers easily use one video + audio diffusion backbone with SDS-style updates to nudge any differentiable task—impacts, lighting, cloth, fluids—until the joint model says “looks & sounds right” given powerful user controls, like text.
May 9, 2025 at 4:06 PM
⚠️ Limitations ⚠️

Clip-Length Budget: We optimized on ≤10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.
May 9, 2025 at 4:06 PM
⚠️ Limitations ⚠️

Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here.
May 9, 2025 at 4:06 PM
This project was led by the great work of @jrichterpowell.bsky.social along with Antonio Torralba.

See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...

Work supported indirectly by MIT CSAIL, @vectorinstitute.ai

#nvidia #mit
May 9, 2025 at 4:06 PM
Results on Prompt-Guided Source Separation:

We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.
May 9, 2025 at 4:06 PM
Results on Tuning FM Synthesizers & Impact Synthesis:

We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.
May 9, 2025 at 4:06 PM
Results on Fully-Automatic In-the-Wild Source Separation:

We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.
May 9, 2025 at 4:06 PM
Modifications to SDS for Audio Diffusion:

🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️ multiple denoising steps to increase fidelity.

This image highlights these in red in the detailed overview of our update.
May 9, 2025 at 4:06 PM
③ Prompt-Guided Source Separation:

A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.
May 9, 2025 at 4:06 PM
② Physical Impact Synthesis:

We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.
May 9, 2025 at 4:06 PM
① FM Synthesis:

A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer.

We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.
May 9, 2025 at 4:06 PM
We propose three novel audio tasks: ① FM Synthesis, ② Physical Impact Synthesis, and ③ Prompt-Guided Source Separation.

This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.
May 9, 2025 at 4:06 PM
Intuitively, our update finds a direction to move the audio to increase its probability given the prompt, by noising and denoising with our diffusion model, then “nudging” our audio towards it by propagating the update through our differentiable rendering to our audio parameters.
May 9, 2025 at 4:06 PM