Giannis Daras
giannisdaras.bsky.social
Giannis Daras
@giannisdaras.bsky.social
@MIT Postdoctoral Researcher.
Read the full paper for more results, including motif scaffolding:
biorxiv.org/content/10.1...
Code, Models, and Data (full release very soon):
github.com/jozhang97/am...
July 8, 2025 at 4:05 AM
Our framework builds on recent innovations in training diffusion models from corrupted data.

AF corruption is not structured, it is not explicitly modeled, and it varies across protein size and topology. Yet, our framework still handles it.
July 8, 2025 at 4:05 AM
Our model does not simply memorize the dataset.

We achieve novelty improvements, showing more unique structure generation.

This is achieved by using more datapoints, as low pLDDT AF structures are not filtered out as done previously.
July 8, 2025 at 4:05 AM
We build on Genie2—scaling it to 17M params, changing the dataset, and training on longer proteins yields gains.

Our framework further boosts performance, leading to the best model for short and long protein generation.

Handling noise properly matters more than architecture.
July 8, 2025 at 4:05 AM
Beyond algorithmic advances, we re-clustered AFDB since we found significant structural duplication across evolutionarily distant clusters.

This redundancy causes an overrepresentation of common motifs.

We fix it by tuning FoldSeek to explicitly focus on structural topology.
July 8, 2025 at 4:05 AM
The results are quite strong.

Ambient Protein Diffusion substantially outperforms previous baselines in short and long protein generation.

For short proteins, we dominate the Pareto frontier between designability and diversity, using a ~13x smaller model than previous SOTA.
July 8, 2025 at 4:05 AM
Ambient Protein Diffusion treats low pLDDT AF structures as low-quality data.

Instead of filtering them out (as done in prior work), we use them for a subset of the diffusion times.

Enough noise "erases" the AF mistakes, and we can still learn from those structures.
July 8, 2025 at 4:05 AM
Obtaining large structure datasets experimentally is impossible.

SOTA protein structure models are trained on AFDB (214M AlphaFold predicted structures) subsets.

AF accuracy drops with increasing protein length and complexity, making it hard to generate such proteins.
July 8, 2025 at 4:05 AM
Announcing Ambient Protein Diffusion, a state-of-the-art 17M-params generative model for protein structures.

Diversity improves by 91% and designability by 26% over the previous 200M SOTA model for long proteins.

The trick? Treat low pLDDT AlphaFold predictions as low-quality data.
July 8, 2025 at 4:05 AM