Lightnews — Scholar-powered news

Jonathan Lorraine

@jonlorraine.bsky.social

Apply here: nvidia.eightfold.ai/careers?star...

I'm personally interested in multimodal generation and the tools that power it.

NVIDIA 2026 Internships: PhD Generative AI Research - US | NVIDIA Corporation

By submitting your resume, you're expressing interest in one of our 2026 Generative AI focused Research Internships. We'll review resumes on an ongoing basis, and a recruiter may reach out if your exp...

nvidia.eightfold.ai

October 9, 2025 at 4:57 AM

Jonathan Lorraine

@jonlorraine.bsky.social

We find a new set of use cases for Stable Audio Open ( @jordiponsdotme.bsky.social, @stabilityai.bsky.social, @hf.co) and other large pretrained audio generative models, like AudioLDM and beyond!

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Our work is inspired by and builds on the SDS update of DreamFusion (dreamfusion3d.github.io/, @benmpoole.bsky.social , @ajayjain9.bsky.social , @jonbarron.bsky.social), and related updates (VSD, SDI @vincentsitzmann.bsky.social, SJC, many more!)

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

💡 SDS treats any differentiable parameter set as optimizable from a prompt. Source-guided separation emerged when we brainstormed novel uses. We hope for similarly practical tasks to surface—e.g., automatic Foley layering?—as the community experiments.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

🚀 Vision of the Future: Content designers easily use one video + audio diffusion backbone with SDS-style updates to nudge any differentiable task—impacts, lighting, cloth, fluids—until the joint model says “looks & sounds right” given powerful user controls, like text.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

⚠️ Limitations ⚠️

Clip-Length Budget: We optimized on ≤10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

⚠️ Limitations ⚠️

Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

This project was led by the great work of @jrichterpowell.bsky.social along with Antonio Torralba.

See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...

Work supported indirectly by MIT CSAIL, @vectorinstitute.ai

#nvidia #mit

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Results on Prompt-Guided Source Separation:

We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Results on Tuning FM Synthesizers & Impact Synthesis:

We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Results on Fully-Automatic In-the-Wild Source Separation:

We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Modifications to SDS for Audio Diffusion:

🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️ multiple denoising steps to increase fidelity.

This image highlights these in red in the detailed overview of our update.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

③ Prompt-Guided Source Separation:

A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

② Physical Impact Synthesis:

We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

① FM Synthesis:

A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer.

We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

We propose three novel audio tasks: ① FM Synthesis, ② Physical Impact Synthesis, and ③ Prompt-Guided Source Separation.

This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.

May 9, 2025 at 4:06 PM

Jonathan Lorraine

@jonlorraine.bsky.social

Intuitively, our update finds a direction to move the audio to increase its probability given the prompt, by noising and denoising with our diffusion model, then “nudging” our audio towards it by propagating the update through our differentiable rendering to our audio parameters.

May 9, 2025 at 4:06 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news