Bin Shao
binshaophy.bsky.social
Bin Shao
@binshaophy.bsky.social
Broadie. deep learning; synthetic biology; single cell genomics; non-linear dynamics. opinions are my own.
Latest work: https://www.biorxiv.org/content/10.1101/2024.12.30.630741v2
TXpredict captures variations in gene expression both across different protein functional groups and within the same functional group.
January 4, 2025 at 11:47 PM
We further used TXpredict to predict the expression of 3.1M genes across a collection of 900 microbial genomes. Small clusters of ribosomal genes located at the periphery of the tSNE plot of all genes and showed high predicted expressions.
January 4, 2025 at 11:47 PM
Our model leverages information learned from ESM2 model and basic protein statistics to predict genome-wide gene expression. It achieves an average Spearman correlation of 0.53 in predicting gene expression for bacterial genomes that are not in the training dataset:
January 4, 2025 at 11:47 PM
8/n 🧩 EcoVAE can also interpolate missing occurrences. For example: In North America, EcoVAE predictions for Sassafras largely overlapped with iNaturalist records. In South Asia, EcoVAE highlighted a wider distribution of Desmodium, consistent with field surveys.
December 18, 2024 at 1:09 AM
7/n 🌍Where is biodiversity under-sampled? We found that regions with high prediction error overlap with known "darkspots" of biodiversity collection. For example, the highest prediction errors for plants were observed in South Asia, Southeast Asia, the Middle East, and Central Africa.
December 18, 2024 at 1:09 AM
6/n 🦋EcoVAE isn’t limited to plants. The model generalizes well to other taxa, including butterflies and mammals, showcasing its versatility across ecosystems.
December 18, 2024 at 1:09 AM
5/n🖥️Remarkably, EcoVAE can predict species distributions even with sparse inputs. With just 20% of input data, it achieved an AUROC of 0.78, effectively identifying the locations of missing genera.
December 18, 2024 at 1:09 AM
4/n🌍 We withheld data from three independent regions to test its generalization. The model reconstructed species distributions effectively—even for withheld test regions—and predicted the location of missing records at genus and species levels.
December 18, 2024 at 1:09 AM
3/n 🚀We leverage a VAE structure that enables fast and scalable modeling of species distribution patterns. In training, we masked 50% of species records and tasked the model to reconstruct full species distribution, mimicking real-world biodiversity sampling
December 18, 2024 at 1:09 AM
2/n 🌿Biodiversity is under immense pressure. Predicting global species distributions at scale is critical, but traditional species distribution models struggle with massive datasets and interspecies interactions (e.g., >33M records and >127K species of plants)
December 18, 2024 at 1:09 AM