Lightnews — Scholar-powered news

Austin Wang

@austintwang.bsky.social

180 followers 380 following 12 posts

Stanford CS PhD student working on ML/AI for genomics with @anshulkundaje.bsky.social

austintwang.com

Posts Replies Media Videos

Austin Wang

@austintwang.bsky.social

I think that’ll be interesting to look more into! The profile information does not convey overall accessibility since it’s normalized, but maybe this sort of multitasking could help.

December 14, 2024 at 3:24 PM

Austin Wang

@austintwang.bsky.social

Thank you for the kind words! Yes, ChromBPNet uses unmodified models, which includes profile data and a bias model. However these evaluations use only the count head.

December 11, 2024 at 6:14 AM

Austin Wang

@austintwang.bsky.social

(10/10) Come check out our poster (tomorrow Dec 11 at 11 AM) or read the paper for more details!

arxiv.org/abs/2412.05430

github.com/kundajelab/D...

neurips.cc/virtual/2024...

#machinelearning #NeurIPS2024 #genomics

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Recent advances in self-supervised models for natural language, vision, and protein sequences have inspired the development of large genomic DNA language models (DNALMs). These models aim to learn gen...

arxiv.org

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(9/10) How do we train more effective DNALMs? Use better data and objectives:
• Nailing short-context tasks before long-context
• Data sampling to account for class imbalance
• Conditioning on cell type context
These strategies use external annotations, which are plentiful!

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(8/10) This indicates that DNALMs inconsistently learn functional DNA. We believe that the culprit is not architecture, but rather the sparse and imbalanced distribution of functional DNA elements.

Given their resource requirements, current DNALMs are a hard sell.

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(7/10) DNALMs struggle with more difficult tasks.
Furthermore, small models trained from scratch (<10M params) routinely outperform much larger DNALMs (>1B params), even after LoRA fine-tuning!
Our results on the hardest task - counterfactual variant effect prediction.

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(6/10) We introduce DART-Eval, a suite of five biologically informed DNALM evaluations focusing on transcriptional regulatory DNA ordered by increasing difficulty.

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(5/10) Rigorous evaluations of DNALMs, though critical, are lacking. Existing benchmarks:
• Focus on surrogate tasks tenuously related to practical use cases
• Suffer from inadequate controls and other dataset design flaws
• Compare against outdated or inappropriate baselines

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(4/10) An effective DNALM should:
• Learn representations that can accurately distinguish different types of functional DNA elements
• Serve as a foundation for downstream supervised models
• Outperform models trained from scratch

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(3/10) However, DNA is vastly different from text, being much more heterogeneous, imbalanced, and sparse. Imagine a blend of several different languages interspersed with a load of gibberish.

December 11, 2024 at 2:30 AM

Austin Wang

@austintwang.bsky.social

(2/10) DNALMs are a new class of self-supervised models for DNA, inspired by the success of LLMs. These DNALMs are often pre-trained solely on genomic DNA without considering any external annotations.

December 11, 2024 at 2:30 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news