Gonzalo Benegas
gonzalobenegas.bsky.social
Gonzalo Benegas
@gonzalobenegas.bsky.social
Comp Bio Postdoc @ UC Berkeley
https://gonzalobenegas.github.io/
Thank you for contributing to bioicons! Sorry I forgot to add to acknowledgements, I will in the final version!
February 15, 2025 at 7:29 PM
Thank you Remi!
February 14, 2025 at 6:34 PM
TraitGym is available on HuggingFace, including a Colab notebook to eval a model in few minutes:
huggingface.co/datasets/son...
songlab/TraitGym · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
February 13, 2025 at 8:57 PM
Check out the paper for more details, including stratification by consequence trait, and eQTL:
www.biorxiv.org/content/10.1...
Benchmarking DNA Sequence Models for Causal Regulatory Variant Prediction in Human Genetics
Machine learning holds immense promise in biology, particularly for the challenging task of identifying causal variants for Mendelian and complex traits. Two primary approaches have emerged for this t...
www.biorxiv.org
February 13, 2025 at 8:57 PM
Scaling is probably part of the solution, but data curation might be the major bottleneck. The vast majority of bases in mammalian genomes lack evolutionary constraint which is precisely the signal leveraged by self-supervision.
February 13, 2025 at 8:57 PM
Alignment-free DNA language models are not yet competitive. The best among them, our GPN-Promoter and SpeciesLM from @gagneurlab.bsky.social , are not the largest in number of parameters or context. Their key feature is having been trained only on functional regions of the genome.
February 13, 2025 at 8:57 PM
Conservation-aware CADD and GPN-MSA do better on Mendelian trait variants, expected to be under strong purifying selection. On complex trait variants, especially for non-disease traits, functional-genomics models Enformer and Borzoi tend to do better. However, ensembling helps:
February 13, 2025 at 8:57 PM
We evaluate models zero-shot (unsupervised) and with linear probing (logistic regression on top of extracted features):
February 13, 2025 at 8:57 PM
We evaluate a wide range of models with up to 7B parameters and 500K context size. Do these numbers matter? 🤔
February 13, 2025 at 8:57 PM
We collect putative causal variants from OMIM and UKBB with carefully matched controls.
February 13, 2025 at 8:57 PM
I still believe in alignment-free gLMs with better data curation and loss functions, I've been seeing advances but still tough.
February 2, 2025 at 7:36 PM
*An exception are alignment-based gLMs which do improve (non-trivially) over conservation scores.
February 2, 2025 at 7:36 PM
A simple bar is: do you surpass conservation scores in identifying functional mutations? This bar was easily passed by pLMs and plant gLMs but not yet by human gLMs* even after 5 years.
February 2, 2025 at 7:35 PM
Thank you Jo!
January 12, 2025 at 7:34 PM