Lightnews — Scholar-powered news

Giannis Daras

@giannisdaras.bsky.social

Read the full paper for more results, including motif scaffolding:
biorxiv.org/content/10.1...
Code, Models, and Data (full release very soon):
github.com/jozhang97/am...

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Our framework builds on recent innovations in training diffusion models from corrupted data.

AF corruption is not structured, it is not explicitly modeled, and it varies across protein size and topology. Yet, our framework still handles it.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Our model does not simply memorize the dataset.

We achieve novelty improvements, showing more unique structure generation.

This is achieved by using more datapoints, as low pLDDT AF structures are not filtered out as done previously.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

We build on Genie2—scaling it to 17M params, changing the dataset, and training on longer proteins yields gains.

Our framework further boosts performance, leading to the best model for short and long protein generation.

Handling noise properly matters more than architecture.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Beyond algorithmic advances, we re-clustered AFDB since we found significant structural duplication across evolutionarily distant clusters.

This redundancy causes an overrepresentation of common motifs.

We fix it by tuning FoldSeek to explicitly focus on structural topology.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

The results are quite strong.

Ambient Protein Diffusion substantially outperforms previous baselines in short and long protein generation.

For short proteins, we dominate the Pareto frontier between designability and diversity, using a ~13x smaller model than previous SOTA.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Ambient Protein Diffusion treats low pLDDT AF structures as low-quality data.

Instead of filtering them out (as done in prior work), we use them for a subset of the diffusion times.

Enough noise "erases" the AF mistakes, and we can still learn from those structures.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Obtaining large structure datasets experimentally is impossible.

SOTA protein structure models are trained on AFDB (214M AlphaFold predicted structures) subsets.

AF accuracy drops with increasing protein length and complexity, making it hard to generate such proteins.

July 8, 2025 at 4:05 AM

Giannis Daras

@giannisdaras.bsky.social

Announcing Ambient Protein Diffusion, a state-of-the-art 17M-params generative model for protein structures.

Diversity improves by 91% and designability by 26% over the previous 200M SOTA model for long proteins.

The trick? Treat low pLDDT AlphaFold predictions as low-quality data.

July 8, 2025 at 4:05 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news