Giannis Daras
giannisdaras.bsky.social
Giannis Daras
@giannisdaras.bsky.social
@MIT Postdoctoral Researcher.
Joint work with the amazing Jeffrey Zhang (equal contribution) and with other wonderful people: D. Diaz, K. Ravishankar, W. Daspit, A. Klivans, C. Daskalakis, Q. Liu.
July 8, 2025 at 4:05 AM
Read the full paper for more results, including motif scaffolding:
biorxiv.org/content/10.1...
Code, Models, and Data (full release very soon):
github.com/jozhang97/am...
July 8, 2025 at 4:05 AM
Our framework builds on recent innovations in training diffusion models from corrupted data.

AF corruption is not structured, it is not explicitly modeled, and it varies across protein size and topology. Yet, our framework still handles it.
July 8, 2025 at 4:05 AM
Our model does not simply memorize the dataset.

We achieve novelty improvements, showing more unique structure generation.

This is achieved by using more datapoints, as low pLDDT AF structures are not filtered out as done previously.
July 8, 2025 at 4:05 AM
We build on Genie2—scaling it to 17M params, changing the dataset, and training on longer proteins yields gains.

Our framework further boosts performance, leading to the best model for short and long protein generation.

Handling noise properly matters more than architecture.
July 8, 2025 at 4:05 AM
Beyond algorithmic advances, we re-clustered AFDB since we found significant structural duplication across evolutionarily distant clusters.

This redundancy causes an overrepresentation of common motifs.

We fix it by tuning FoldSeek to explicitly focus on structural topology.
July 8, 2025 at 4:05 AM
The results are quite strong.

Ambient Protein Diffusion substantially outperforms previous baselines in short and long protein generation.

For short proteins, we dominate the Pareto frontier between designability and diversity, using a ~13x smaller model than previous SOTA.
July 8, 2025 at 4:05 AM
Ambient Protein Diffusion treats low pLDDT AF structures as low-quality data.

Instead of filtering them out (as done in prior work), we use them for a subset of the diffusion times.

Enough noise "erases" the AF mistakes, and we can still learn from those structures.
July 8, 2025 at 4:05 AM
Obtaining large structure datasets experimentally is impossible.

SOTA protein structure models are trained on AFDB (214M AlphaFold predicted structures) subsets.

AF accuracy drops with increasing protein length and complexity, making it hard to generate such proteins.
July 8, 2025 at 4:05 AM
Thanks a lot for your interest and for your post!
Let me know if you have any questions or thoughts!
July 8, 2025 at 3:57 AM