Maciek Wiatrak
macwiatrak.bsky.social
Maciek Wiatrak
@macwiatrak.bsky.social
PhD student @ Cambridge Centre for AI in Medicine (CCAIM). I do 💻 🧬 💊 and love ⛰️.
Finally, we used Bacformer to generate sequences of protein families given a prompt. Bacformer generates sequences which span essential functions and resemble real genomes. We also used Bacformer to generate sequences for a desired traits, like oxygen requirement.

🧵 13/n
July 21, 2025 at 9:56 AM
Accurately predicting phenotypic traits opens a possibility to discover the genes that are likely causally associated with specific traits. We used gradient-based attribution to identify the genes involved in sporulation and host adaptation.

🧵 12/n
July 21, 2025 at 9:56 AM
Bacformer can also predict diverse phenotypic traits! We trained Bacformer to predict 139 phenotypes from the genome alone.

We then used high performing phenotypes to annotate our corpus of >1.3M genomes with 32 diverse phenotypic traits.

🧵 11/n
July 21, 2025 at 9:56 AM
By leveraging the whole-genome context, we show how Bacformer boosts performance on gene essentiality and protein function annotation task.

To us, this make sense as gene's essentiality and function is often tied with its genomic neighborhoud in bacteria.

🧵 10/n
July 21, 2025 at 9:56 AM
It can also be used to predict the protein-protein interactions across diverse bacteria.

To do it, we fine-tuned the model on STRING DB and used it to predict the interactome of P. aeruginosa, with the top scoring pairs showing high-confidence interfaces based on AF3.

🧵 9/n
July 21, 2025 at 9:56 AM
It also nails operon detection! We validated our predictions with long-read RNA sequencing, showing how Bacformer can be used for operon identification even in a zero-shot setup.

🧵 8/n
July 21, 2025 at 9:56 AM
Bacformer can be used for a range of bacterial genomics tasks zero-shot or finetuned.

We examined whether Bacformer can uncover the evolutionary relationships by examining if the genome embeddings from Bacformer can be used for clustering without any "species" token.

🧵 7/n
July 21, 2025 at 9:56 AM
If each token is a protein, how do we do pretraining if the space of possible proteins is effectively unbound? We leverage the similarities between proteins and create a discrete vocabulary of 50,000 “protein family” clusters using protein embeddings from a pLM.

🧵 6/n
July 21, 2025 at 9:56 AM
To capture patterns across diverse bacteria, we pretrained Bacformer on a curated corpus of over 1.3M metagenomes spanning over 25,000 species across diverse environments and containing almost 3B proteins.

🧵 5/n
July 21, 2025 at 9:56 AM
Bacformer represents each genome as a sequence of proteins, ordered by their location on the genome, providing a unified representation across bacterial species, and allowing us to learn evolutionary patterns across all bacteria, rather than a single species.

🧵 4/n
July 21, 2025 at 9:56 AM
💥 Excited to introduce Bacformer 🦠 - the first foundation model for bacterial genomics. Bacformer represents genomes as sequences of ordered proteins, learning the “grammar” of how genes are arranged, interact and evolve.

Preprint 📝: biorxiv.org/content/10.1...

🧵 1/n
July 21, 2025 at 9:56 AM