Abdul Muntakim Rafi
muntakimrafi.bsky.social
Abdul Muntakim Rafi
@muntakimrafi.bsky.social
PhD candidate @SBME_UBC | Machine Learning | Gene regulation
Amazing. Will get back to you.
January 30, 2025 at 7:42 PM
-we haven't yet tried on a varied downstream tasks. I believe there would some downstream tasks where you have more leakage than others
-we haven' tried a varied set of pretrained models
-need to integrate hashFrag.we used chrom. splits before. might have a reason we didn't see much differenc in mag
January 30, 2025 at 7:41 PM
this is a very important point. The expression levels for the different test subsets are from a wide range (they are not only sequences with high expression or sequences with low expression).
January 30, 2025 at 7:27 PM
-for models that overfit to different degrees, the drop in performance would be different.
-the drop in performance would vary by datasets, tasks, as well.
January 30, 2025 at 7:23 PM
I think its possible. Actually this was and remains in our to-do list.
January 30, 2025 at 6:38 PM
therefore, I cannot provide any evidence of leakage happening in transfer learning (for now). But I would suggest to avoid it whereas possible.
take a look at @jmbartoszewicz.bsky.social & Melanias'
www.biorxiv.org/content/10.1...
Beware of Data Leakage from Protein LLM Pretraining
Pretrained protein language models are becoming increasingly popular as a backbone for protein property inference tasks such as structure prediction or function annotation, accelerating biological res...
www.biorxiv.org
January 30, 2025 at 6:32 PM
this was a problem I was particularly interested about. to show that leakage can occur when testing fine tuned model on pretraining data. A student from our group pursued it for some time. But we were unable to detect leakage in transfer learning to a degree where everyone would care.
January 30, 2025 at 6:32 PM
10/hashFrag is openly available and accessible,so it’s win,win,win:more accurate perf. estimates,better perf. overall,and easy to use.We hope that hashFrag sets the new standard for how data are split for trainin genome models
Github: github.com/de-Boer-Lab/...
Paper: www.biorxiv.org/content/10.1...
GitHub - de-Boer-Lab/hashFrag: A command-line tool to mitigate homology-based data leakage in sequence-to-expression models
A command-line tool to mitigate homology-based data leakage in sequence-to-expression models - de-Boer-Lab/hashFrag
github.com
January 27, 2025 at 11:04 PM
9/ Not only do hashFrag generated train-test splits effectively mitigate leakage, but hashFrag-trained models even outperformed chromosomal split-trained models, showing that chromosomal splitting not only introduces train-test leakage but also creates inferior train-val splits.
January 27, 2025 at 11:04 PM
8/ We applied hashFrag to test datasets. Across models tested, model performance was inflated by the presence of test sequences that were similar to training sequences. hashFrag revealed more reliable performance measures.
January 27, 2025 at 11:04 PM
7/ To detect and avoid homology based leakage, we created hashFrag, which leverages BLAST to identify similar sequences and then either (1) filter out the leaked sequences from the test set, (2) stratify the test set into subgroups by distance, or (3) create leakage-free train-test splits.
January 27, 2025 at 11:04 PM
6/ We analyzed GWAS SNVs from OpenTarget with PIP>0.1 and found a substantial percentage of these SNVs have their alternate alleles, along with their flanking sequences, replicated on other chromosomes, often many times.
January 27, 2025 at 11:04 PM
5/ An important application of models is to predict the effect of variants. However, variants along with their flanking region can be replicated throughout the genome. Without accounting for homology, you can’t tell if the model’s prediction is based on learned cis-regulatory logic or memorization.
January 27, 2025 at 11:04 PM
4/ We saw a very interesting trend where models fit to the most similar test sequences early during training, faster than they fit the overall training data, making these sequences unreliable for evaluating actual performance.
January 27, 2025 at 11:04 PM
3/We created the cheeky OverfitNN as a maximally overfit benchmark, which is nearest neighbor-based and has no understanding of cis regulation. As expected, OverfitNN only works well for closely related sequences, but even neural networks work best for sequences that are similar to their train data.
January 27, 2025 at 11:04 PM
2/ We compared regulatory regions against each other using chromosomal splitting and found that many genomic sequences are very similar compared to unrelated sequences. We set out to investigate how this similarity could cause train-test leakage.
January 27, 2025 at 11:04 PM
1/Typically, genome is split into train & test by chromosomes without accounting for homologous sequences. Because similar sequences encode similar activities, a model could conceivably correctly predict the activity of test sequences that are very similar to train sequences just by memorizing them.
January 27, 2025 at 11:04 PM
Amazing collaboration between de Boer lab (@CarldeBoerPhD, myself) and Yachie lab (@yachielab, @nzmyachie, Brett Kiyota)
November 14, 2024 at 6:31 AM
9/ With so many cool technologies being developed, it's an exciting time to be involved in sequence modeling! Make sure to use and strive to beat the state-of-the-art when applying models.
📄 Paper: www.nature.com/articles/s41...
💻 GitHub: github.com/de-Boer-Lab/random-promoter-dream-challenge-2022
A community effort to optimize sequence-based deep learning models of gene regulation - Nature Biotechnology
A benchmarking competition improves tools that predict how regulatory regions control gene expression.
www.nature.com
November 14, 2024 at 6:25 AM