Lightnews — Scholar-powered news

Mathieu Blondel

@mblondel.bsky.social

Am I the only one who feels this is awful? If someone wants to remain anonymous, people should respect that...

February 18, 2025 at 8:51 AM

Mathieu Blondel

@mblondel.bsky.social

yes!

February 10, 2025 at 2:06 PM

Mathieu Blondel

@mblondel.bsky.social

Cool work! We recently found that Tsallis q=1.5 (alpha=1.5 in our notation) seems to works really well across several datasets for language modeling arxiv.org/abs/2501.18537 It would be great to find some theoretical justification for why 1.5 seems to be a sweet spot.

Loss Functions and Operators Generated by f-Divergences

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language...

arxiv.org

February 10, 2025 at 1:03 PM

Mathieu Blondel

@mblondel.bsky.social

The reason for this is because the usual duality theory still works when working in the spaces of functions and probability measures, while it doesn't if we work in the space of network parameters. We need to apply duality first and then parameterize, not the other way around!

January 31, 2025 at 1:53 PM

Mathieu Blondel

@mblondel.bsky.social

Surprisingly, we found that we still obtain good performance even if we use the classical softargmax at inference time and our losses at train time. This means that we can keep the inference code the same and just change the training code, which is useful e.g. for open-weight LMs

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

We obtain good performance across several language modeling tasks with the alpha-divergence, for alpha=1.5.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

The table below summarizes the link between some entropies and f-divergences.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

2) We instantiate Fenchel-Young losses with f-divergence regularization. This generalizes the cross-entropy loss in two directions: i) by replacing the KL with f-divergences and ii) by allowing non-uniform prior class weights. Each loss is associated with a f-softargmax operator.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

Our approach naturally generalizes to Fenchel-Young losses, allowing us to obtain the first tractable approach for optimizing the sparsemax loss in general combinatorial spaces.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

We propose a new joint formulation for learning the EBM and the log-partition, and a MCMC-free doubly stochastic optimization scheme with unbiased gradients.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

Pushing this idea a little bit further, we can parameterize the log-partition as a separate neural network. This allows us to evaluate the *learned* log-partition on new data points.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

By treating the log-partition not as a quantity to compute but as a variable to optimize, we no longer need it to be exact (in machine learning we never look for exact solutions to optimization problems!).

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

1) EBMs are generally challenging to train due to the partition function (normalization constant). At first, learning the partition function seems weird O_o But the log-partition exactly coincides with the Lagrange multiplier (dual variable) associated with equality constraints.

January 31, 2025 at 12:06 PM

Mathieu Blondel

@mblondel.bsky.social

Huge congrats!

January 21, 2025 at 12:14 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news