Sebastian Loeschcke
@sloeschcke.bsky.social
Working on Efficient Training, Low-Rank Methods, and Quantization.
PhD at the University of Copenhagen 🇩🇰
Member of @belongielab.org, Danish Data Science Academy, and Pioneer Centre for AI 🤖
🔗 sebulo.github.io/
PhD at the University of Copenhagen 🇩🇰
Member of @belongielab.org, Danish Data Science Academy, and Pioneer Centre for AI 🤖
🔗 sebulo.github.io/
Thanks to my co-authors David Pitt, Robert Joseph George, Jiawwei Zhao, Cheng Luo, Yuandong Tian, Jean Kossaifi, @anima-anandkumar.bsky.social, and @caltech.edu for hosting me this spring!
Paper: arxiv.org/abs/2501.02379
Code: github.com/neuraloperat...
Paper: arxiv.org/abs/2501.02379
Code: github.com/neuraloperat...
TensorGRaD: Tensor Gradient Robust Decomposition for Memory-Efficient Neural Operator Training
Scientific problems require resolving multi-scale phenomena across different resolutions and learning solution operators in infinite-dimensional function spaces. Neural operators provide a powerful fr...
arxiv.org
June 3, 2025 at 3:17 AM
Thanks to my co-authors David Pitt, Robert Joseph George, Jiawwei Zhao, Cheng Luo, Yuandong Tian, Jean Kossaifi, @anima-anandkumar.bsky.social, and @caltech.edu for hosting me this spring!
Paper: arxiv.org/abs/2501.02379
Code: github.com/neuraloperat...
Paper: arxiv.org/abs/2501.02379
Code: github.com/neuraloperat...
We also show strong results on other PDE benchmarks, including 𝐃𝐚𝐫𝐜𝐲 𝐟𝐥𝐨𝐰 and the 𝐁𝐮𝐫𝐠𝐞𝐫𝐬 equation, demonstrating TensorGRaD’s broad applicability across scientific domains.
June 3, 2025 at 3:17 AM
We also show strong results on other PDE benchmarks, including 𝐃𝐚𝐫𝐜𝐲 𝐟𝐥𝐨𝐰 and the 𝐁𝐮𝐫𝐠𝐞𝐫𝐬 equation, demonstrating TensorGRaD’s broad applicability across scientific domains.
We test TensorGRaD on large-scale Navier–Stokes at 1024×1024 resolution with Reynolds number 10e5, a highly turbulent setting. With mixed-precision and 75% optimizer state reduction, it 𝐦𝐚𝐭𝐜𝐡𝐞𝐬 𝐟𝐮𝐥𝐥-𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐀𝐝𝐚𝐦 while cutting overall memory by up to 50%.
June 3, 2025 at 3:17 AM
We test TensorGRaD on large-scale Navier–Stokes at 1024×1024 resolution with Reynolds number 10e5, a highly turbulent setting. With mixed-precision and 75% optimizer state reduction, it 𝐦𝐚𝐭𝐜𝐡𝐞𝐬 𝐟𝐮𝐥𝐥-𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐀𝐝𝐚𝐦 while cutting overall memory by up to 50%.
We also propose a 𝐦𝐢𝐱𝐞𝐝-𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 strategy with weights, activations, and gradients in half precision and optimizer states in full precision, and empirically show that storing optimizer states in half precision hurts performance.
June 3, 2025 at 3:17 AM
We also propose a 𝐦𝐢𝐱𝐞𝐝-𝐩𝐫𝐞𝐜𝐢𝐬𝐢𝐨𝐧 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 strategy with weights, activations, and gradients in half precision and optimizer states in full precision, and empirically show that storing optimizer states in half precision hurts performance.
We extend low-rank and sparse methods to tensors via a 𝐫𝐨𝐛𝐮𝐬𝐭 𝐭𝐞𝐧𝐬𝐨𝐫 𝐝𝐞𝐜𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧 that splits gradients into a low-rank Tucker part and an unstructured sparse tensor. Unlike matricized approaches, we prove our tensor-based method converges.
June 3, 2025 at 3:17 AM
We extend low-rank and sparse methods to tensors via a 𝐫𝐨𝐛𝐮𝐬𝐭 𝐭𝐞𝐧𝐬𝐨𝐫 𝐝𝐞𝐜𝐨𝐦𝐩𝐨𝐬𝐢𝐭𝐢𝐨𝐧 that splits gradients into a low-rank Tucker part and an unstructured sparse tensor. Unlike matricized approaches, we prove our tensor-based method converges.
Recent methods reduce optimizer memory for matrix weights. This includes Low-rank and sparse methods from LLMs that work on matrices. But to use them for Neural Operators, we’d need to flatten tensors, which destroys their spatial/temporal structure and hurts performance.
June 3, 2025 at 3:17 AM
Recent methods reduce optimizer memory for matrix weights. This includes Low-rank and sparse methods from LLMs that work on matrices. But to use them for Neural Operators, we’d need to flatten tensors, which destroys their spatial/temporal structure and hurts performance.
These Neural Operators use tensor weights. However, optimizers like Adam store two full tensors per weight, making memory the bottleneck at scale.
TensorGRaD reduces this overhead by up to 75% (𝑑𝑎𝑟𝑘 𝑔𝑟𝑒𝑒𝑛 𝑏𝑎𝑟𝑠), without hurting accuracy.
TensorGRaD reduces this overhead by up to 75% (𝑑𝑎𝑟𝑘 𝑔𝑟𝑒𝑒𝑛 𝑏𝑎𝑟𝑠), without hurting accuracy.
June 3, 2025 at 3:17 AM
These Neural Operators use tensor weights. However, optimizers like Adam store two full tensors per weight, making memory the bottleneck at scale.
TensorGRaD reduces this overhead by up to 75% (𝑑𝑎𝑟𝑘 𝑔𝑟𝑒𝑒𝑛 𝑏𝑎𝑟𝑠), without hurting accuracy.
TensorGRaD reduces this overhead by up to 75% (𝑑𝑎𝑟𝑘 𝑔𝑟𝑒𝑒𝑛 𝑏𝑎𝑟𝑠), without hurting accuracy.
Scientific computing operates on multiscale, multidimensional (𝐭𝐞𝐧𝐬𝐨𝐫) 𝐝𝐚𝐭𝐚. In weather forecasting, for example, inputs span space, time, and variables. Neural operators can capture these multiscale phenomena by learning an operator that maps between function spaces.
June 3, 2025 at 3:17 AM
Scientific computing operates on multiscale, multidimensional (𝐭𝐞𝐧𝐬𝐨𝐫) 𝐝𝐚𝐭𝐚. In weather forecasting, for example, inputs span space, time, and variables. Neural operators can capture these multiscale phenomena by learning an operator that maps between function spaces.
While Pasadena will be my home, I’ll also be making trips to Austin, the Bay Area, and San Diego. If you’re nearby and up for a chat, reach out—let’s meet up!
January 28, 2025 at 3:57 PM
While Pasadena will be my home, I’ll also be making trips to Austin, the Bay Area, and San Diego. If you’re nearby and up for a chat, reach out—let’s meet up!