Lightnews — Scholar-powered news

Scott Pesme

@skate-the-apple.bsky.social

Maybe we can invite Macron to present a poster for the next Neurips@Paris edition

November 27, 2025 at 10:18 PM

Scott Pesme

@skate-the-apple.bsky.social

learned --> learnt

January 8, 2025 at 4:33 PM

Scott Pesme

@skate-the-apple.bsky.social

Ah oui ça doit être ça, bien vu !

January 6, 2025 at 8:10 PM

Scott Pesme

@skate-the-apple.bsky.social

Emmanuel Macron, Jean-Luc Mélenchon, mais qui est "PM"??

January 6, 2025 at 12:44 PM

Scott Pesme

@skate-the-apple.bsky.social

For some reason the Firefox web browser was the issue, I switched to Chrome and it worked fine! Thanks for the help :)

December 23, 2024 at 1:32 PM

Scott Pesme

@skate-the-apple.bsky.social

did they remove the recording? will they put it back online at some point?

December 20, 2024 at 8:13 PM

Scott Pesme

@skate-the-apple.bsky.social

Voilà! This was a super fun project, and I'd be happy to discuss more with anyone interested. A huge thanks to my advisor Nicolas who was of great help all along the project.
An the bonus: an incremental Romain Gary!

November 19, 2024 at 4:53 PM

Scott Pesme

@skate-the-apple.bsky.social

To sum up: we prove and characterise the saddle-to-saddle dynamics which occurs when training diagonal linear networks with vanishing initialisation. The visited saddles and jump times can be computed using a simple algorithm.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

It turns out that this key equation enables us to fully describe the trajectory. Indeed the successive saddles as well as the jumping times can be computed using an algorithm reminiscent of the Homotopy / LARS algorithm which computes the Lasso regularisation path. The job is done!

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

Thanks to this time reparametrisation, the rescaled potential now converges towards the l1 norm. Consequently we expect the limiting flow to follow a precise key equation.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

Now, in order to expose the limiting dynamics when taking the initialisation to zero. We must "accelerate time", otherwise the iterates are simply stuck at the origin.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

Can we describe the visited saddles and their order? Can we compute the jumping times? We answer to all three of these questions in the paper! To do so we leverage the classical mirror flow point of view.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

Now if we have a look at the iterates, we observe that the coordinates successively activate one after another (bottom). From a loss landscape point of view, the iterates jump from a saddle point of the loss to another (right). Hence the name of "saddle-to-saddle" dynamics.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

Experimentally, when training such networks with constant stepsize gradient descent, we observe that the train loss behaves more and more as a piecewise constant process as we take the initialisation scale to 0. As is the case with more complex networks!

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

To understand and describe this phenomenon we consider our favourite architecture: a 2-layer diagonal linear network. Useless in practice, but very rich with insights!

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

This odd behaviour is usually referred to as incremental learning and can occur for various tasks and architectures. However these piecewise constant curves only appear when the weights of the architecture are initialised close to zero.

November 19, 2024 at 4:49 PM

Scott Pesme

@skate-the-apple.bsky.social

I had great fun collaborating with Mathieu, Suriya and Nicolas on this paper.

November 19, 2024 at 4:23 PM

Scott Pesme

@skate-the-apple.bsky.social

Some takeaways:

November 19, 2024 at 4:23 PM

Scott Pesme

@skate-the-apple.bsky.social

We explain the difference between SGD and GD by showing that the induced regularisation norm for SGD is indeed the l1 norm. For GD it is a weighted l1 norm which penalises the coordinates we would want to recover! This explains the poor performances of GD with large stepsize.

November 19, 2024 at 4:23 PM

Scott Pesme

@skate-the-apple.bsky.social

But how come SGD with large stepsize performs so well when GD with large stepsize performs very poorly?! To understand this, we need to look at the shape of the vector "Gain(gamma)".

November 19, 2024 at 4:23 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news