Lightnews — Scholar-powered news

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

110 followers 530 following 32 posts

PhD candidate @SBME_UBC | Machine Learning | Gene regulation

Posts Replies Media Videos

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

9/ Not only do hashFrag generated train-test splits effectively mitigate leakage, but hashFrag-trained models even outperformed chromosomal split-trained models, showing that chromosomal splitting not only introduces train-test leakage but also creates inferior train-val splits.

A. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the
training sequences (x-axis) for both chromosomal splits (blue) and hashFrag-pure (red), with approximately 80% of the sequences for training and 20% for test sets.
B. hashFrag-split trained models outperform chromosomal split trained
models. Performances across 100 replicates (points; y-axes) of different models (columns) on the designed sequences from Gosai et al. (20) when trained on different chromosomal and hashFrag
splits (x-axes). Statistical significance between hashFrag and chromosomally trained models was calculated using the Two-Sample t-test.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

8/ We applied hashFrag to test datasets. Across models tested, model performance was inflated by the presence of test sequences that were similar to training sequences. hashFrag revealed more reliable performance measures.

hashFrag removes overestimation of model performance. Model
performance (Pearson 𝑟2; y-axes) across different models (columns) for different chromosomal splits (rows) following the removal of similar sequences using hashFrag-pure at different maximum SW score thresholds (x-axes).

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

7/ To detect and avoid homology based leakage, we created hashFrag, which leverages BLAST to identify similar sequences and then either (1) filter out the leaked sequences from the test set, (2) stratify the test set into subgroups by distance, or (3) create leakage-free train-test splits.

Overview of the hashFrag method. Each sequence in the dataset is subjected to the BLASTn algorithm to identify candidate homologous sequences in the dataset. False-positive candidates (denoted with a red ‘X’) are subsequently removed based on their SW local alignment scores according to a specified threshold, resulting in a network where only probable homologs are connected (solid lines in the network). Cases of detected homology can be used to either filter out homologs from test data for existing data splits, further stratify the test split into subsets based on similarity to the train split, or create new orthogonal data splits.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

6/ We analyzed GWAS SNVs from OpenTarget with PIP>0.1 and found a substantial percentage of these SNVs have their alternate alleles, along with their flanking sequences, replicated on other chromosomes, often many times.

A. Percentage of GWAS SNVs (y-axis) with SNV doppelgängers of each sequence length (x-axis).
B. Number of fine mapped GWAS SNVs (y-axes) with the corresponding number of SNV doppelgängers (x-axes) on other chromosomes in the genome for 41 bp regions

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

5/ An important application of models is to predict the effect of variants. However, variants along with their flanking region can be replicated throughout the genome. Without accounting for homology, you can’t tell if the model’s prediction is based on learned cis-regulatory logic or memorization.

Illustration of (i) homology across chromosomes, (ii) SNVs associated with diseases, and (iii) SNV doppelgängers, sequences elsewhere in the genome with an identical sequence to the GWAS alternate allele, including its flanking region.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

4/ We saw a very interesting trend where models fit to the most similar test sequences early during training, faster than they fit the overall training data, making these sequences unreliable for evaluating actual performance.

Neural networks trained on different chromosomal splits show
the same trend of varying levels of performance on different degrees of homology. Performance of different models (columns) in Pearson 𝑟2 (y-axes) during model training (x-axes) for different chromosomal spits.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

3/We created the cheeky OverfitNN as a maximally overfit benchmark, which is nearest neighbor-based and has no understanding of cis regulation. As expected, OverfitNN only works well for closely related sequences, but even neural networks work best for sequences that are similar to their train data.

Model performance on test data depends on similarity to training
data. Performance comparison (Pearson 𝑟2; y-axes) of different models (OverfitNN, DREAM-CNN, DREAM-RNN, DREAM-Attn, and MPRAnn; colors) across varying levels of homology (SW alignment
score, x-axes) in different chromosomal folds.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

2/ We compared regulatory regions against each other using chromosomal splitting and found that many genomic sequences are very similar compared to unrelated sequences. We set out to investigate how this similarity could cause train-test leakage.

Homology is common between chromosomes. Histogram showing the number of test sequences (y-axis) with corresponding maximum pairwise SW local alignment scores with the training sequences (x-axis) for both genomic (blue) and dinucleotide shuffled (red) sequences, with training and test sets randomly sampled from distinct
chromosome sets (20,000 each).

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

1/Typically, genome is split into train & test by chromosomes without accounting for homologous sequences. Because similar sequences encode similar activities, a model could conceivably correctly predict the activity of test sequences that are very similar to train sequences just by memorizing them.

Homologous sequences from two different chromosomes can share functional genomic signals. ATAC-seq read counts (y-axis) for two homologous 1000 bp regions (x-axis) on chromosomes 9 and 16 (colours) in K562 cells.

January 27, 2025 at 11:04 PM

Abdul Muntakim Rafi

@muntakimrafi.bsky.social

Had a lot of fun at the CSHL Biological Data Science conference.

Thanks to the scholarship from the "James P. Taylor Foundation for open science" for making it possible.

#cshl

November 17, 2024 at 11:02 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news