Jonathan Lorraine
banner
jonlorraine.bsky.social
Jonathan Lorraine
@jonlorraine.bsky.social
Research scientist @NVIDIA | PhD in machine learning @UofT. Previously @Google / @MetaAI. Opinions are my own. 🤖 💻 ☕️
This project was led by the great work of @jrichterpowell.bsky.social along with Antonio Torralba.

See more work from the NVIDIA Spatial Intelligence Lab: research.nvidia.com/labs/toronto...

Work supported indirectly by MIT CSAIL, @vectorinstitute.ai

#nvidia #mit
May 9, 2025 at 4:06 PM
Results on Prompt-Guided Source Separation:

We report an improved SDR to ground-truth sources when available and show improved CLAP scores after training.
May 9, 2025 at 4:06 PM
Results on Tuning FM Synthesizers & Impact Synthesis:

We improve CLAP scores over training for prompts, along with qualitative results. Impact synthesis shows improved performance on impact-oriented prompts.
May 9, 2025 at 4:06 PM
Results on Fully-Automatic In-the-Wild Source Separation:

We demonstrate a pipeline that takes a video from the internet, captions the audio with a model (like AudioCaps), and provides that to an LLM-assistant who suggests source decompositions. We run our method on the suggested decompositions.
May 9, 2025 at 4:06 PM
Modifications to SDS for Audio Diffusion:

🅰 We use an augmented Decoder-SDS in audio space, 🅱 using a spectrogram emphasis to better weight transients, and 🅲️ multiple denoising steps to increase fidelity.

This image highlights these in red in the detailed overview of our update.
May 9, 2025 at 4:06 PM
③ Prompt-Guided Source Separation:

A prompt-conditioning source separation for a given audio, such as separating a “sax …” and “cars …” from a music recording on a road, by using the audio-SDS update for each channel while forcing the sum of channels to reconstruct the audio.
May 9, 2025 at 4:06 PM
② Physical Impact Synthesis:

We generate impacts consistent with prompts like “hitting pot with wooden spoon” by convolving an impact with a learned object and reverb impulse. We learn the parametrized forms of the object and reverb impulses.
May 9, 2025 at 4:06 PM
① FM Synthesis:

A toy setup where we generate settings aligning with prompts like “kick drum, bass, reverb” using sine oscillators modulating each other’s frequency as in a synthesizer.

We visualize the final optimized parameters as the dial settings on a synthesizer instrument's user interface.
May 9, 2025 at 4:06 PM
We propose three novel audio tasks: ① FM Synthesis, ② Physical Impact Synthesis, and ③ Prompt-Guided Source Separation.

This image briefly summarizes the use case, optimizable parameters, rendering function, and parameter update.
May 9, 2025 at 4:06 PM
Intuitively, our update finds a direction to move the audio to increase its probability given the prompt, by noising and denoising with our diffusion model, then “nudging” our audio towards it by propagating the update through our differentiable rendering to our audio parameters.
May 9, 2025 at 4:06 PM
🔊 New NVIDIA paper: Audio-SDS 🔊
We repurpose Score Distillation Sampling (SDS) for audio, turning any pretrained audio diffusion model into a tool for diverse tasks, including source separation, impact synthesis & more.

🎧 Demos, audio examples, paper: research.nvidia.com/labs/toronto...

🧵below
May 9, 2025 at 4:06 PM
This project was led by Zhengyi Wang with Yikai Wang, Hang Su, Jun Zhu, Sanja Fidler, and Xiaohui Zeng.

See more work from the #NVIDIA Toronto AI Lab here: research.nvidia.com/labs/toronto...

Work supported by Tsinghua University, @vectorinst.bsky.social, @uoft.bsky.social #UofT #Tsinghua
December 12, 2024 at 7:11 PM
We generate diverse and high-quality 3D meshes directly from textual prompts without expanding the vocabulary or introducing new tokenizers.
December 12, 2024 at 7:11 PM
The model retains its language-understanding abilities, demonstrating coherent and contextually appropriate dialogues and being able to describe meshes in natural language.
December 12, 2024 at 7:11 PM
LLaMA-Mesh achieves mesh generation quality comparable to specialized models trained from scratch on 3D data, as evidenced by qualitative comparisons with state-of-the-art methods like MeshXL.
December 12, 2024 at 7:11 PM
We construct a dataset of text-mesh pairs and interleaved text-3D dialogues. We fine-tune a pre-trained LLaMA-3.1-8B-Instruct model on our curated dataset, allowing it to generate 3D meshes directly from text prompts and engage in conversational 3D content creation.
December 12, 2024 at 7:11 PM
We quantize vertex coordinates in the OBJ files, reducing the token count, with minimal impact on geometric fidelity.
December 12, 2024 at 7:11 PM
We represent 3D meshes using the OBJ file format, converting vertex coordinates and face definitions into plain text sequences that LLMs can process directly without modifying tokenizers or vocabularies.
December 12, 2024 at 7:11 PM
We unify text and 3D mesh in a uniform format by representing the numerical values of vertex coordinates and face definitions of a 3D mesh as plain text. We train using text and 3D interleaved data end-to-end. With a single, unified model, we can generate both text and 3D meshes.
December 12, 2024 at 7:11 PM
Check out the blender addon powered by LLaMA-Mesh from @dylanebert.bsky.social! It’s an impressive example of how mesh generation can be integrated into familiar creative workflows, streamlining the design process 🔥

🧩 Blender Addon: github.com/huggingface/...
December 12, 2024 at 7:11 PM
Explore our interactive LLaMA-Mesh demo on @hf.co , built with @gradio-hf.bsky.social. Experiment with generating/understanding meshes from text input and examine the model’s performance firsthand—a step towards conversational 3D design workflows.

🕹️ Demo: huggingface.co/spaces/Zheng...
December 12, 2024 at 7:11 PM
🦙New #NVIDIA paper: LLaMA-Mesh 🦙

We enable LLMs to generate 3D meshes by representing them as plain text and fine-tuning, unifying 3D and text modalities in a single model.

🔎 Webpage research.nvidia.com/labs/toronto...
🕹️ Interactive Demo huggingface.co/spaces/Zheng...
💾 Model checkpoint available
December 12, 2024 at 7:11 PM
Huge thanks to my amazing collaborators. This project was led by Juhan Bae along with Wu Lin and @rogergrosse.bsky.social
Supported (indirectly) by @anthropic.com , NVIDIA, @vectorinst.bsky.social, @uoft.bsky.social/ @uoftartsci.bsky.social
November 27, 2024 at 5:41 PM
By removing the top-k influential training points identified by SOURCE and retraining, we more accurately predicted changes in the model's behavior compared to other methods.
November 27, 2024 at 5:41 PM
We tested SOURCE against other methods using the linear datamodeling score (LDS). SOURCE outperforms others, especially when models haven't fully converged or are trained in multiple stages!
November 27, 2024 at 5:41 PM