Maciek Wiatrak
macwiatrak.bsky.social
Maciek Wiatrak
@macwiatrak.bsky.social
PhD student @ Cambridge Centre for AI in Medicine (CCAIM). I do 💻 🧬 💊 and love ⛰️.
If you'd like to collaborate, get in touch at macwiatrak@gmail.com, or DM me directly.

🧵 16/n
July 21, 2025 at 9:56 AM
Most importantly, huge shoutout to everyone who made this possible! Thanks to them, the project was first and foremost immense fun!

🧵 15/n
July 21, 2025 at 9:56 AM
We hope that Bacformer will accelerate microbial discovery and we are excited for the future work!

That’s it! Thanks for sticking with me through this thread!

🧵 14/n
July 21, 2025 at 9:56 AM
Finally, we used Bacformer to generate sequences of protein families given a prompt. Bacformer generates sequences which span essential functions and resemble real genomes. We also used Bacformer to generate sequences for a desired traits, like oxygen requirement.

🧵 13/n
July 21, 2025 at 9:56 AM
Accurately predicting phenotypic traits opens a possibility to discover the genes that are likely causally associated with specific traits. We used gradient-based attribution to identify the genes involved in sporulation and host adaptation.

🧵 12/n
July 21, 2025 at 9:56 AM
Bacformer can also predict diverse phenotypic traits! We trained Bacformer to predict 139 phenotypes from the genome alone.

We then used high performing phenotypes to annotate our corpus of >1.3M genomes with 32 diverse phenotypic traits.

🧵 11/n
July 21, 2025 at 9:56 AM
By leveraging the whole-genome context, we show how Bacformer boosts performance on gene essentiality and protein function annotation task.

To us, this make sense as gene's essentiality and function is often tied with its genomic neighborhoud in bacteria.

🧵 10/n
July 21, 2025 at 9:56 AM
It can also be used to predict the protein-protein interactions across diverse bacteria.

To do it, we fine-tuned the model on STRING DB and used it to predict the interactome of P. aeruginosa, with the top scoring pairs showing high-confidence interfaces based on AF3.

🧵 9/n
July 21, 2025 at 9:56 AM
It also nails operon detection! We validated our predictions with long-read RNA sequencing, showing how Bacformer can be used for operon identification even in a zero-shot setup.

🧵 8/n
July 21, 2025 at 9:56 AM
Bacformer can be used for a range of bacterial genomics tasks zero-shot or finetuned.

We examined whether Bacformer can uncover the evolutionary relationships by examining if the genome embeddings from Bacformer can be used for clustering without any "species" token.

🧵 7/n
July 21, 2025 at 9:56 AM
If each token is a protein, how do we do pretraining if the space of possible proteins is effectively unbound? We leverage the similarities between proteins and create a discrete vocabulary of 50,000 “protein family” clusters using protein embeddings from a pLM.

🧵 6/n
July 21, 2025 at 9:56 AM
To capture patterns across diverse bacteria, we pretrained Bacformer on a curated corpus of over 1.3M metagenomes spanning over 25,000 species across diverse environments and containing almost 3B proteins.

🧵 5/n
July 21, 2025 at 9:56 AM
Bacformer represents each genome as a sequence of proteins, ordered by their location on the genome, providing a unified representation across bacterial species, and allowing us to learn evolutionary patterns across all bacteria, rather than a single species.

🧵 4/n
July 21, 2025 at 9:56 AM
Why make ML models for bacteria🦠? 1️⃣Bacteria shape every ecosystem, and our own health. 2️⃣They are easier to model than mammalian cells: their genomes are small, compact and mostly coding’; they have no real epigenome.3️⃣We have a lot of data to (thanks to metagenomics)!

🧵 3/n
July 21, 2025 at 9:56 AM