I'm personally interested in multimodal generation and the tools that power it.
I'm personally interested in multimodal generation and the tools that power it.
Clip-Length Budget: We optimized on ≤10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.
Clip-Length Budget: We optimized on ≤10 s clips; minute-scale audio may have artifacts or blow up memory. A hierarchical/windowed Audio-SDS could help here.
Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here.
Audio-Model Bias: We rely on Stable Audio Open, so when this struggles, e.g., on rare instruments, speech, audio without silence at the end, or out-of-domain SFX, our method can have difficulties. Other diffusion models can help here.
See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...
Work supported indirectly by MIT CSAIL, @vectorinstitute.ai
#nvidia #mit
See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...
Work supported indirectly by MIT CSAIL, @vectorinstitute.ai
#nvidia #mit
We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.
We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.
We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.
We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.
We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.
We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.
🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️ multiple denoising steps to increase fidelity.
This image highlights these in red in the detailed overview of our update.
🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️ multiple denoising steps to increase fidelity.
This image highlights these in red in the detailed overview of our update.
A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.
A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.
We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.
We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.
A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer.
We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.
A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer.
We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.
This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.
This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.