Surag Nair
suragnair.bsky.social
Surag Nair
@suragnair.bsky.social
Machine learning and genetics @Genentech. Previously CS PhD @Stanford.

suragnair.github.io
We trained a small fLM on base-resolution ATAC-seq. It can invert the signal to recover genotype information with high accuracy, even with as few as 5 million reads per sample. This has immediate privacy implications for sharing fragment files. 10/
November 10, 2025 at 9:01 PM
Functional genotyping: scATAC-seq has taken off. Fragment files are the de facto file format. They are treated as privacy-preserving, often shared openly even when raw reads are access-controlled. Using AFGR data, we find that common variants alter base-res ATAC-seq profiles. 9/
November 10, 2025 at 9:01 PM
fLMs are also discrete diffusion models! Nona fLM can generate DNA under functional constraints, e.g. sequences producing weak, strong, left-skewed, or even double-humped DNase-seq profiles.

They allow parallel decoding, with competitive performance at fewer generation steps. 8/
November 10, 2025 at 9:01 PM
Functional language models (fLM): DNA LMs are great at capturing co-evolutionary sequence patterns, but can't connect them to cell-type specific regulation. An fLM conditioned on GM12878 DNase-seq picks up more transcription factor motif features than plain LMs. 7/
November 10, 2025 at 9:01 PM
The context-aware model also improves predictions of promoter expression across diverse integration sites as measured by TRIP-seq experiments. 6/
November 10, 2025 at 9:01 PM
Turns out the biggest gains are at loci showing outlier chromatin states. Here's an example of a heterochromatinized locus where sequence-only model gets the locus wrong, but context-aware model rescues local prediction. 5/
November 10, 2025 at 9:01 PM
Context-aware models: We improve local genomic predictions by providing flanking track measurements (~196 kb) as input. This outperforms sequence-to-function models by up to 13% on the test set. What's driving these improvements? 4/
November 10, 2025 at 9:01 PM
Multimodal masking provides a unified approach.

Nona operates on both DNA sequence and functional genomics tracks. Task-specific masking configurations recover familiar model types, and its flexibility enables entirely new approaches! 2/
November 10, 2025 at 9:01 PM
Excited to share Nona: a unifying multimodal masking framework for functional genomics.

Models for DNA have evolved along separate paths: sequence-to-function (AlphaGenome), language models (Evo2), and generative models (DDSM).

Can these be unified under a single paradigm? 1/15
November 10, 2025 at 9:01 PM