Lightnews — Scholar-powered news

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

16/ Thank you to my co-first authors @phil-fradkin.bsky.social @heyitsmeianshi.bsky.social, all the authors who got these datasets together (especially Divya Koyyalagunta), and my mentor @quaidmorris.bsky.social / Phil’s mentor @bowang87.bsky.social! Also special thanks to the MSK HPC / NSF-GRFP 💪🏽

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

15/ The great news going forward is that mRNABench is 🚀open source, 🪶lightweight, and 🤖modular!

We made it easy to:

📊 Add datasets
🖥️ Add new models
📈Benchmark reproducibly

We hope you will give it a try, and if you have any comments or ideas, feel free to reach out!

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

14/ Conclusions

1. genomic sequences != language -> we need training objectives suited to the heterogeneity inherent to genomic data

2. relevant benchmarking tasks and evaluating generalizability is important as these models start to be used for therapeutic design

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

13/ ❌ All models failed badly at this test, with significant drops in Pearson correlation when compared to a random data split, indicating that they don’t fully understand how these regulatory features interact.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

12/ For this task, we used a dataset that measures mean ribosome load when the 5’ UTR sequence is varied.

We know that uUAGs reduce translation, and strong Kozak sequences enhance it, so we trained linear probes using 3 subsets of these features and tested on the held out set.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

11/ Finally, we wanted to evaluate whether current SSL models are compositional ➡️ do they understand how sequence elements that they have seen before interact when combined?

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

10/ We also show that random data splits (vs biologically-aware data splits) inflate model performance because structurally or functionally related sequences end up in both training and test sets, overestimating model generalization. This drop is highly task dependent.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

9/ From a compression standpoint, sequence composition differs significantly between genomic regions, explaining why ncRNA models and models trained largely on intronic or intergenic regions might poorly on mRNA specific tasks

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

8/ We trained a simple data compressor (using Huffman encoding) to measure how “similar” different parts of the genome are, by training the compressor on each region (CDS, 5’ UTR, 3’ UTR, introns, etc) and then trying to compress the other regions.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

7/ Taking a step back, we also wanted to explore why models pretrained on DNA or ncRNA perform so poorly on mRNA specific tasks

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

6/
✅ Orthrus+MLM matches/beats SOTA on 6/10 datasets without increasing the training data or significantly increasing model parameters
📈 Pareto-dominates all models larger than 10 million parameters, including Evo2

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

5/ To address this, we pretrained a joint contrastive + MLM Orthrus variant, and investigated the optimal ratio between these two objectives.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

4/ Orthrus, unlike the other models, uses a contrastive learning objective (which has been shown to yield worse performance on finer resolution tasks). In line with this, we notice that Orthrus underperforms on nucleotide level (local) tasks vs transcript-wide (global) tasks.

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

3/ As expected, we see that scaling parameter count generally leads to performance improvements, with Evo2 7 billion performing the best overall

Interestingly, we noticed that the 2nd best model, Orthrus, was not far off compared to Evo2 despite having only 10 million parameters

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

2/ We benchmarked nearly all public models trained on DNA (e.g. HyenaDNA), ncRNA (e.g. AIDO.RNA), mRNA (e.g. Orthrus), and across all of the above (Evo2), evaluating them via linear probes.

In total, we conducted over 135,000 total experiments!

So what did we learn? 👇🏽

July 15, 2025 at 7:09 PM

taykhoomdalal.bsky.social

@taykhoomdalal.bsky.social

1/ Existing DNA/RNA benchmarks focus on tasks that are either not predictable from mRNA sequence or are structure based. In mRNABench, we brought together 10 datasets that focus on salient mRNA properties and function, like mean ribosome load, localization, half life, etc

July 15, 2025 at 7:09 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news