Lightnews — Scholar-powered news

Reposted by Mathieu Dagréou

Emtiyaz Khan

@emtiyaz.mastodon.social.ap.brid.gy

AISTATS 2026 will be in Morocco!

July 30, 2025 at 8:07 AM

Reposted by Mathieu Dagréou

Francis Bach

@bachfrancis.bsky.social

Tired of lengthy computations to derive scaling laws? This post is made for you: discover the sharpness of the z-transform!
francisbach.com/z-transform/

July 18, 2025 at 2:24 PM

Reposted by Mathieu Dagréou

Franck Iutzeler

@franck-iutzeler.bsky.social

❓ How long does SGD take to reach the global minimum on non-convex functions?
With W. Azizian, J. Malick, P. Mertikopoulos, we tackle this fundamental question in our new ICML 2025 paper: "The Global Convergence Time of Stochastic Gradient Descent in Non-Convex Landscapes"

June 18, 2025 at 1:59 PM

Reposted by Mathieu Dagréou

Mathurin Massias

@mathurinmassias.bsky.social

New paper on the generalization of Flow Matching www.arxiv.org/abs/2506.03719

🤯 Why does flow matching generalize? Did you know that the flow matching target you're trying to learn *can only generate training points*?

w @quentinbertrand.bsky.social @annegnx.bsky.social @remiemonet.bsky.social 👇👇👇

June 18, 2025 at 8:08 AM

Reposted by Mathieu Dagréou

Julyan Arbel

@julyanarbel.bsky.social

Register at PAISS 1-5 Sept 2025 @inria_grenoble with very talented speakers this year 🙂
paiss.inria.fr
cc @mvladimirova.bsky.social

June 9, 2025 at 8:55 PM

Reposted by Mathieu Dagréou

Tam Le

@ntamle.bsky.social

🎉🎉🎉Our paper "Inexact subgradient methods for semialgebraic
functions" is accepted at Mathematical Programming !! This is a joint work with Jerome Bolte, Eric Moulines and Edouard Pauwels where we study a subgradient method with errors for nonconvex nonsmooth functions.

arxiv.org/pdf/2404.19517

arxiv.org

June 5, 2025 at 6:13 AM

Reposted by Mathieu Dagréou

Gabriel Peyré

@gabrielpeyre.bsky.social

I have cleaned a bit my lecture notes on Optimal Transport for Machine Learners arxiv.org/abs/2505.06589

Optimal Transport for Machine Learners

Optimal Transport is a foundational mathematical theory that connects optimization, partial differential equations, and probability. It offers a powerful framework for comparing probability distributi...

arxiv.org

May 13, 2025 at 5:18 AM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

📣 New preprint 📣

**Differentiable Generalized Sliced Wasserstein Plans**

w/
L. Chapel
@rtavenar.bsky.social

We propose a Generalized Sliced Wasserstein method that provides an approximated transport plan and which admits a differentiable approximation.

arxiv.org/abs/2505.22049 1/5

June 2, 2025 at 2:40 PM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

The Nesterov Accelerated Gradient (NAG) algorithm refines gradient descent by using an extrapolation step before computing the gradient. It leads to faster convergence for smooth convex functions, achieving the optimal rate of O(1/k^2). www.mathnet.ru/links/ceedfb...

April 25, 2025 at 5:01 AM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

Our paper

**Geometric and computational hardness of bilevel programming**

w/ Jerôme Bolte, Tùng Lê & Edouard Pauwels

has been accepted to Mathematical Programming!

We study how difficult it may be to solve bilevel optimization beyond strongly convex inner problems

arxiv.org/abs/2407.12372

Geometric and computational hardness of bilevel programming

We first show a simple but striking result in bilevel optimization: unconstrained $C^\infty$ smooth bilevel programming is as hard as general extended-real-valued lower semicontinuous minimization. We...

arxiv.org

April 1, 2025 at 3:55 PM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

📣 New preprint 📣

Learning Theory for Kernel Bilevel Optimization

w/ @fareselkhoury.bsky.social E. Pauwels @michael-arbel.bsky.social

We provide generalization error bounds for bilevel optimization problems where the inner objective is minimized over a RKHS.

arxiv.org/abs/2502.08457

February 20, 2025 at 1:55 PM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

Complex step approximation is a numerical method to approximate the derivative from a single function evaluation using complex arithmetic. It is some kind of “poor man” automatic differentiation. https://nhigham.com/2020/10/06/what-is-the-complex-step-approximation/

February 17, 2025 at 6:00 AM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

📢 New preprint 📢

Bilevel gradient methods and Morse parametric qualification

w/ J. Bolte, T. Lê, E. Pauwels

We study bilevel optimization with a nonconvex inner objective. To do so, we propose a new setting (Morse parametric qualification) to study bilevel algorithms.

arxiv.org/abs/2502.09074

February 14, 2025 at 10:18 AM

Reposted by Mathieu Dagréou

Julien Mairal

@jmairal.bsky.social

My rant of the day: You can be rejected from an MSCA postdoctoral fellowship despite scoring almost the maximum in every category and receiving virtually no negative feedback. Before I explain why this is disastrous (assuming it is a common issue), a few comments.

February 10, 2025 at 7:28 AM

Reposted by Mathieu Dagréou

Pierre Ablin

@pierreablin.bsky.social

Excited to share Soup-of-Experts, a new neural network architecture that, for any given specific task, can instantiate in a flash a small model that is very good on it.

Made with ❤️ at Apple

Thanks to my co-authors David Grangier, Angelos Katharopoulos, and Skyler Seto!

arxiv.org/abs/2502.01804

February 5, 2025 at 9:32 AM

Reposted by Mathieu Dagréou

Fabian Schaipp

@fschaipp.bsky.social

Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.

A short thread on our latest paper 🚞

arxiv.org/abs/2501.18965

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...

arxiv.org

February 5, 2025 at 10:13 AM

Reposted by Mathieu Dagréou

Eugene Berta

@eberta.bsky.social

Early stopping on validation loss? This leads to suboptimal calibration and refinement errors—but you can do better!
With @dholzmueller.bsky.social, Michael I. Jordan, and @bachfrancis.bsky.social, we propose a method that integrates with any model and boosts classification performance across tasks.

February 3, 2025 at 1:03 PM

Reposted by Mathieu Dagréou

Pierre Ablin

@pierreablin.bsky.social

Excited to see Sigmoid Attention accepted at ICLR 2025 !!

Make attention ~18% faster with a drop-in replacement 🚀

Code:
github.com/apple/ml-sig...

Paper
arxiv.org/abs/2409.04431

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as...

arxiv.org

January 24, 2025 at 6:47 PM

Reposted by Mathieu Dagréou

Adrian Hill

@adrianhill.de

Conventional wisdom in ML is that the computation of full Jacobians and Hessians should be avoided. Instead, practitioners are advised to compute matrix-vector products, which are more in line with the inner workings of automatic differentiation (AD) backends such as PyTorch and JAX.

How to compute Hessian-vector products? | ICLR Blogposts 2024

The product between the Hessian of a function and a vector, the Hessian-vector product (HVP), is a fundamental quantity to study the variation of a function. It is ubiquitous in traditional optimizati...

iclr-blogposts.github.io

January 30, 2025 at 2:32 PM

Reposted by Mathieu Dagréou

Adrian Hill

@adrianhill.de

You think Jacobian and Hessian matrices are prohibitively expensive to compute on your problem? Our latest preprint with @gdalle.bsky.social might change your mind!
arxiv.org/abs/2501.17737
🧵1/8

Figure comparing automatic differentiation (AD) and automatic sparse differentiation (ASD).

(a) Given a function f, AD backends return a function computing vector-Jacobian products (VJPs). (b) Standard AD computes Jacobians row-by-row by evaluating VJPs with all standard basis vectors. (c) ASD reduces the number of VJP evaluations by first detecting a sparsity pattern of non-zero values, coloring orthogonal rows in the pattern and simultaneously evaluating VJPs of orthogonal rows. The concepts shown in this figure directly translate to forward-mode, which computes Jacobians column-by-column instead of row-by-row.

January 30, 2025 at 2:32 PM

Reposted by Mathieu Dagréou

Gabriel Peyré

@gabrielpeyre.bsky.social

The Mathematics of Artificial Intelligence: In this introductory and highly subjective survey, aimed at a general mathematical audience, I showcase some key theoretical concepts underlying recent advancements in machine learning. arxiv.org/abs/2501.10465

January 22, 2025 at 9:11 AM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

The Cheap Gradient Principle (Baur—Strassen, 1983) states that computing gradients via automatic differentiation is efficient: the gradient of a function f can be obtained with a cost proportional to evaluating f

January 16, 2025 at 6:00 AM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

The Marchenko–Pastur law describes the limiting spectral distribution of eigenvalues of large random covariance matrices. The theorem proves that as dimensions grow, the empirical spectral distribution converges weakly to this law. https://buff.ly/406Xaxg

January 22, 2025 at 6:00 AM

Reposted by Mathieu Dagréou

Judith Abécassis

@judithabk6.bsky.social

⭐️⭐️ Internship positions ⭐️⭐️
1) NLP and predictive ML to improve the management of stroke, in a multi-disciplinary and stimulating environment under the joint supervision of @adrien3000 from the @TeamHeka, me from @soda_INRIA and Eric Jouvent from @APHP team.inria.fr/soda/files/...

January 10, 2025 at 2:16 PM

Reposted by Mathieu Dagréou

Samuel Vaiter

@samuelvaiter.com

When optimization problems have multiple minima, algorithms favor specific solutions due to their implicit bias. For ordinary least squares (OLS), gradient descent inherently converges to the minimal norm solution among all possible solutions. https://fa.bianp.net/blog/2022/implicit-bias-regression/

December 19, 2024 at 6:00 AM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news