Gherman Novakovsky
gnovakovsky.bsky.social
Gherman Novakovsky
@gnovakovsky.bsky.social
PhD, Illumina AI lab
Yes, that's exactly what it is. Predicting the difference here is important.
June 10, 2025 at 4:40 PM
This ensures the model focuses on the actual variant and doesn't overfit to correlated but irrelevant features, which leads to a better generalization.
June 10, 2025 at 6:47 AM
Certainly! Here the entire model is shared, two copies see inputs that differ only at a single base pair (a variant of interest), and the model weights are tuned to learn the difference in effect size correctly.
June 10, 2025 at 6:47 AM
Great question! That's our best guess as well and we highlight this in the paper by saying that MPRA experimental data from individual cell lines could have limitations for variant interpretation.
June 10, 2025 at 6:46 AM
Huge thanks to the amazing Illumina team—this was an incredible learning experience! I'm excited to keep pushing forward as we develop models to tackle gene expression and non-coding variant interpretation. (16/)
May 29, 2025 at 11:57 PM
A complementary thread from my colleague Kishore Jaganathan ‪@kjaganatha.bsky.social‬ bsky.app/profile/kjag... (15/)
We're thrilled to introduce PromoterAI — a tool for accurately identifying promoter variants that impact gene expression. 🧵 (1/)
May 29, 2025 at 11:57 PM
Want to learn more about PromoterAI?
📄 Read the paper: science.org/doi/10.1126/...
💻 Explore the code & precomputed scores: github.com/Illumina/Pro.... (14/)
Predicting expression-altering promoter mutations with deep learning
Only a minority of patients with rare genetic diseases are currently diagnosed by exome sequencing, suggesting that additional unrecognized pathogenic variants may reside in non-coding sequence. Here,...
science.org
May 29, 2025 at 11:57 PM
We followed up by testing promoter variants in Mendelian genes using MPRA. Surprisingly, PromoterAI was more effective than MPRA at prioritizing variants linked to patient phenotypes, highlighting limitations of MPRA for rare disease interpretation. (13/)
May 29, 2025 at 11:57 PM
While we noticed that the use of additional species such as mouse does not lead to substantial improvement of variant effect prediction, it does help with ensembling. Thus, the final model is an ensemble of two: trained on human only and trained on mouse+human together. (12/)
May 29, 2025 at 11:57 PM
In the Genomics England rare disease cohort, functional promoter variants predicted by PromoterAI were enriched in phenotype-matched Mendelian genes. These variants accounted for an estimated 6% of the rare disease genetic burden. (11/)
May 29, 2025 at 11:57 PM
In the UK biobank cohort, PromoterAI's predicted promoter variant effects correlated strongly with measured protein levels and quantitative traits, suggesting that promoter variants contribute meaningfully to phenotypic variation in the general population. (10/)
May 29, 2025 at 11:57 PM
PromoterAI's embeddings split promoters into three distinct classes: P1 (~9K genes, ubiquitously active), P2 (~3K genes, bivalent chromatin), E (~6K genes, enhancer-like). The E class, enriched for TATA boxes, may reflect enhancers co-opted as promoters. (9/)
May 29, 2025 at 11:57 PM
Fine-tuning improved PromoterAI’s ability to predict the direction of motif effects — a known issue of multitask models. The model often recognized motifs before fine-tuning, but got the direction wrong. After fine-tuning, its predictions aligned better with the data. (8/)
May 29, 2025 at 11:57 PM
We used our list of gene expression outliers to explore their effect on transcription factor binding sites. Our results show that it is easier for new variants to cause outlier gene expression by disrupting existing regulatory components rather than creating new ones. (7/)
May 29, 2025 at 11:57 PM
We also attempted to fine-tune Enformer and Borzoi on our promoter variant set. While performance improved, both models lagged behind PromoterAI. Notably, PromoterAI outperformed Enformer and was similar to Borzoi before fine-tuning. (6/)
May 29, 2025 at 11:57 PM
When it comes to predicting expression effects of promoter variants, PromoterAI achieved best performance across benchmarks spanning RNA, proteins, QTLs, and MPRA. (5/)
May 29, 2025 at 11:57 PM
The second step was to fine-tune the model using a carefully curated list of rare promoter variants linked to aberrant gene expression. The fine-tuning was done using a twin-network setup to ensure the generalization across unseen genes and datasets. (4/)
May 29, 2025 at 11:57 PM
First, we pre-trained PromoterAI to predict histone marks, TF binding, DNA accessibility, and CAGE signal from a genomic sequence. The key difference with models like Enformer and Borzoi is that we predict at a single base-pair resolution and use only TSS-centered regions. (3/)
May 29, 2025 at 11:57 PM
PromoterAI is built from transformer-inspired blocks called metaformers — but instead of attention, we use depthwise convolutions, making it a fully convolutional model. We believe that CNN-based methods are not surpassed yet and remain a great choice for genomics tasks. (2/)
May 29, 2025 at 11:57 PM