Lightnews — Scholar-powered news

Scaling is probably part of the solution, but data curation might be the major bottleneck. The vast majority of bases in mammalian genomes lack evolutionary constraint which is precisely the signal leveraged by self-supervision.

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

Alignment-free DNA language models are not yet competitive. The best among them, our GPN-Promoter and SpeciesLM from @gagneurlab.bsky.social , are not the largest in number of parameters or context. Their key feature is having been trained only on functional regions of the genome.

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

Conservation-aware CADD and GPN-MSA do better on Mendelian trait variants, expected to be under strong purifying selection. On complex trait variants, especially for non-disease traits, functional-genomics models Enformer and Borzoi tend to do better. However, ensembling helps:

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

We evaluate models zero-shot (unsupervised) and with linear probing (logistic regression on top of extracted features):

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

We evaluate a wide range of models with up to 7B parameters and 500K context size. Do these numbers matter? 🤔

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

We collect putative causal variants from OMIM and UKBB with carefully matched controls.

February 13, 2025 at 8:57 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

I still believe in alignment-free gLMs with better data curation and loss functions, I've been seeing advances but still tough.

February 2, 2025 at 7:36 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

*An exception are alignment-based gLMs which do improve (non-trivially) over conservation scores.

February 2, 2025 at 7:36 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

A simple bar is: do you surpass conservation scores in identifying functional mutations? This bar was easily passed by pLMs and plant gLMs but not yet by human gLMs* even after 5 years.

February 2, 2025 at 7:35 PM

Gonzalo Benegas

@gonzalobenegas.bsky.social

Thank you Jo!

January 12, 2025 at 7:34 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news