Laurent Jacob
laurentjacob.bsky.social
Laurent Jacob
@laurentjacob.bsky.social
Researcher in statistics and machine learning for genomics

https://laurent-jacob.github.io/
The decisions for LEGEND are out: legend2025.sciencesconf.org/data/book_le...

I'm really looking forward to hearing these 21 exciting presentations (and additional 30 posters) next December.

If you want to attend too, registration is open until October 17th through legend2025.sciencesconf.org
October 8, 2025 at 11:04 AM
Come hear about the latest advances in the field and discuss your own work at Centre Paul Langevin in beautiful Aussois.
February 24, 2025 at 8:58 AM
Burak Yelmen from the University of Tartu will give a keynote presentation on "A perspective on generative neural networks in genomics with applications in synthetic data generation".
February 24, 2025 at 8:58 AM
Claudia Solís-Lemus from the University of Wisconsin-Madison will give a keynote presentation on "The good, the bad and the ugly of deep learning in phylogenetic inference".
February 24, 2025 at 8:58 AM
Anne-Florence Bitbol from EPFL will give a keynote presentation on "Coevolution-aware language models".
February 24, 2025 at 8:58 AM
The next LEGEND conference on machine learning for evolutionary genomics will be in Aussois (French Alps) between December 8th and 12th.

Mark your calendars and make sure your best work is ready next September when the call for abstracts opens 🙂

legend2025.sciencesconf.org
February 24, 2025 at 8:58 AM
All this work was done by Luca Nesterenko and
@lblassel.bsky.social , assisted by P. Veber, Bastien Boussau
and myself.

The code and data are available at github.com/lucanest/Phy...

Please share if you find this interesting, and we welcome your feedback :)
June 24, 2024 at 8:35 AM
In all these experiments, and regardless of model complexity, Phyloformer run on a GPU was the fastest method.

About two orders of magnitude faster than IQtree, and even twice faster than FastME.
June 24, 2024 at 8:33 AM
We then trained Phyloformer under a more realistic model, accounting for co-evolution.

It outperformed all other methods, including IQTree/FastTree, on all metrics.
June 24, 2024 at 8:32 AM
More precisely, Phyloformer was very good at predicting distances, and on the Kuhner-Felsenstein metric accounting for both topology and branch lengths.

Looking at the topology only (Robinson-Foulds metric), it performed less well than IQTree/FastTree, but better than FastME.
June 24, 2024 at 8:32 AM
We first trained Phyloformer to perform inference under LG, a common model under which likelihood computation is possible.

It performed much better than FastME (distance method), on par with maximum likelihood approaches (IQTree, FastTree).
June 24, 2024 at 8:31 AM
Phyloformer uses self-attention to progressively share information among and between sequences.

This choice makes our function invariant to the order of the input sequences (any order yields the same output phylogeny).
June 24, 2024 at 8:30 AM
Once trained, Phyloformer provides estimates of all evolutionary distances given the sequences.

But each of these distance estimates is informed by the entire set of sequence, not just the corresponding pair!

We then pass them to FastME, a distance method, to obtain a tree.
June 24, 2024 at 8:29 AM
Phyloformer is a learnable function. Its input is a set of sequences, its output is their phylogeny, represented by evolutionary distances between all pairs of sequences.

We optimize this function on a large number of (phylogeny, sequences) sampled from the probabilistic model.
June 24, 2024 at 8:28 AM
This is where likelihood-free/simulation-based inference comes into play.

Sampling trees and sequences under a probabilistic model is possible under much more complex models, for which likelihood computations would be prohibitive.

It's an alternative way to access the model.
June 24, 2024 at 8:27 AM
Maximum likelihood approaches on the other hand search for the most likely tree jointly over all sequences.

This makes them accurate but slow. It also restricts these approaches to simplistic models under which likelihood computations are fast enough.
June 24, 2024 at 8:26 AM
Knowing the evolutionary distances (sum of branch lengths) between all pairs of sequences is enough to recover the tree, by hierarchical clustering.

Distance methods rely on this idea, with estimates from pairs of sequences taken separately. This makes them fast but inaccurate.
June 24, 2024 at 8:25 AM
Phylogenetic trees describe how related sequences (at the leaves) evolved from a common ancestor. Internal nodes are successive ancestral sequences.

In probabilistic models, branch lengths represent an expected number of substitutions between the sequences at the two ends.
June 24, 2024 at 8:25 AM
We just released a preprint for Phyloformer, a likelihood-free inference method for phylogenetic reconstruction: biorxiv.org/content/10.1...

Faster than distance methods like neighbor joining, it outperforms maximum likelihood methods under complex models of sequence evolution.

🧵
June 24, 2024 at 8:24 AM