Scott Pesme
banner
skate-the-apple.bsky.social
Scott Pesme
@skate-the-apple.bsky.social
Postdoc at Inria Grenoble with Julien Mairal.
scottpesme.github.io
Maybe we can invite Macron to present a poster for the next Neurips@Paris edition
November 27, 2025 at 10:18 PM
learned --> learnt
January 8, 2025 at 4:33 PM
Ah oui ça doit être ça, bien vu !
January 6, 2025 at 8:10 PM
Emmanuel Macron, Jean-Luc Mélenchon, mais qui est "PM"??
January 6, 2025 at 12:44 PM
For some reason the Firefox web browser was the issue, I switched to Chrome and it worked fine! Thanks for the help :)
December 23, 2024 at 1:32 PM
did they remove the recording? will they put it back online at some point?
December 20, 2024 at 8:13 PM
Voilà! This was a super fun project, and I'd be happy to discuss more with anyone interested. A huge thanks to my advisor Nicolas who was of great help all along the project.
An the bonus: an incremental Romain Gary!
November 19, 2024 at 4:53 PM
To sum up: we prove and characterise the saddle-to-saddle dynamics which occurs when training diagonal linear networks with vanishing initialisation. The visited saddles and jump times can be computed using a simple algorithm.
November 19, 2024 at 4:49 PM
It turns out that this key equation enables us to fully describe the trajectory. Indeed the successive saddles as well as the jumping times can be computed using an algorithm reminiscent of the Homotopy / LARS algorithm which computes the Lasso regularisation path. The job is done!
November 19, 2024 at 4:49 PM
Thanks to this time reparametrisation, the rescaled potential now converges towards the l1 norm. Consequently we expect the limiting flow to follow a precise key equation.
November 19, 2024 at 4:49 PM
Now, in order to expose the limiting dynamics when taking the initialisation to zero. We must "accelerate time", otherwise the iterates are simply stuck at the origin.
November 19, 2024 at 4:49 PM
Can we describe the visited saddles and their order? Can we compute the jumping times? We answer to all three of these questions in the paper! To do so we leverage the classical mirror flow point of view.
November 19, 2024 at 4:49 PM
Now if we have a look at the iterates, we observe that the coordinates successively activate one after another (bottom). From a loss landscape point of view, the iterates jump from a saddle point of the loss to another (right). Hence the name of "saddle-to-saddle" dynamics.
November 19, 2024 at 4:49 PM
Experimentally, when training such networks with constant stepsize gradient descent, we observe that the train loss behaves more and more as a piecewise constant process as we take the initialisation scale to 0. As is the case with more complex networks!
November 19, 2024 at 4:49 PM
To understand and describe this phenomenon we consider our favourite architecture: a 2-layer diagonal linear network. Useless in practice, but very rich with insights!
November 19, 2024 at 4:49 PM
This odd behaviour is usually referred to as incremental learning and can occur for various tasks and architectures. However these piecewise constant curves only appear when the weights of the architecture are initialised close to zero.
November 19, 2024 at 4:49 PM
I had great fun collaborating with Mathieu, Suriya and Nicolas on this paper.
November 19, 2024 at 4:23 PM
Some takeaways:
November 19, 2024 at 4:23 PM
We explain the difference between SGD and GD by showing that the induced regularisation norm for SGD is indeed the l1 norm. For GD it is a weighted l1 norm which penalises the coordinates we would want to recover! This explains the poor performances of GD with large stepsize.
November 19, 2024 at 4:23 PM
But how come SGD with large stepsize performs so well when GD with large stepsize performs very poorly?! To understand this, we need to look at the shape of the vector "Gain(gamma)".
November 19, 2024 at 4:23 PM