Damiano Sgarbossa
banner
damianosg.bsky.social
Damiano Sgarbossa
@damianosg.bsky.social
PhD in Computational Biology & ML for Proteins @EPFL

https://sites.google.com/view/damiano-sgarbossa
📈 Despite its smaller size, ProtMamba is better than SOTA on conditional sequence generation and competitive with other protein language models on fitness prediction, showing the importance of long-context conditioning.

Read it here: doi.org/10.1093/bioi...
Github repo: github.com/Bitbol-Lab/P...
July 7, 2025 at 4:48 PM
🧬 ProtMamba applications include:
- Generating novel protein sequences conditioned on a given set of homologs,
- Inpainting specific regions within sequences,
- Modeling disordered regions of different protein sequences,
- Predicting the fitness of protein variants.
July 7, 2025 at 4:48 PM
⚙️ ProtMamba is based on Mamba, a state space model that efficiently handles very long sequences. The model uses a fill-in-the-middle training objective, combining autoregressive modeling and masked language modeling to predict amino acids conditioned on the given homologs.
July 7, 2025 at 4:48 PM
🔍 ProtMamba is homology-aware yet alignment-free, meaning it captures evolutionary information without relying on multiple sequence alignments. This allows it to avoid the imperfections of MSAs but still use the information of other homologs to condition the generation!
July 7, 2025 at 4:48 PM
Also, a huge thanks to my supervisor Anne-Florence and my defense committee: Bruno Correia @pschwllr.bsky.social @sokrypton.org and Thomas Lemmin
June 30, 2025 at 11:42 AM
This is a work that I did in collaboration with Anne-Florence Bitbol @epfl-ai-center.bsky.social. #CompBio #DeepLearning #ProteinEngineering #AI #MachineLearning #ICLR2025
April 11, 2025 at 2:54 PM
RAG-ESM is simple to implement, compatible with pretrained ESM2 checkpoints, and efficient to train (~50–120 GPU hours).

Come check my poster (spotlight) at the MLGenX workshop at ICLR in Singapore!

Code (still WIP): github.com/Bitbol-Lab/r...
Preprint: doi.org/10.1101/2025...

7/7
GitHub - Bitbol-Lab/rag-esm
Contribute to Bitbol-Lab/rag-esm development by creating an account on GitHub.
github.com
April 11, 2025 at 2:47 PM
RAG-ESM is trained with a discrete diffusion objective, giving it generative capabilities. RAG-ESM achieves SOTA among sequence-based models for conditional generation and motif scaffolding. It outperforms DPLM (650M), EvoDiff-MSA, and ProtMamba on key benchmarks.

6/7
April 11, 2025 at 2:47 PM
An unexpected result: Several cross-attention heads naturally learn to align the input and context sequences, even though the model is trained on unaligned data. This alignment capability emerges purely from the training objective (no explicit alignment supervision).

5/7
April 11, 2025 at 2:47 PM
Using just one homolog as context, RAG-ESM models (12M and 165M params) outperform ESM2 (650M) on masked token prediction. We obtain a 40–50% reduction in perplexity despite using much fewer parameters.

4/7
April 11, 2025 at 2:47 PM
Conditioning on homologs reduces the effective dimensionality of the search space during inference. Instead of encoding information of entire protein families internally, the model can focus its weights on more nuanced biological features.

3/7
April 11, 2025 at 2:47 PM
What does RAG-ESM do?
It augments ESM2 with a few lightweight cross-attention layers that let us condition the model on retrieved homologous sequences. This allows the model to leverage evolutionary information during inference without retraining.

2/7
April 11, 2025 at 2:47 PM