Jeff Ruffolo
banner
jeffruffolo.bsky.social
Jeff Ruffolo
@jeffruffolo.bsky.social
Protein Design / ML @ Profluent Bio | Molecular Biophysics PhD @ Johns Hopkins
Not only do we see compelling benchmark performance, but also that these aligned capabilities extend to generative settings, which is what really matters for design. Meaning, with just a bit of data we can steer the models to generate the high-fitness sequences we want.
April 17, 2025 at 8:29 PM
Coming back to fitness prediction, we wanted to see if this greater understanding of protein sequence space translated to stronger predictive power. We turned to alignment, where we use a bit of experimental data to tilt the model towards properties we care about, like stability.
April 17, 2025 at 8:29 PM
This extends even to proteins that had low (or no) homology to anything in the models’ training data, where we still see comparable rates of protein expression, including for proteins with very low AlphaFold2 pLDDT.
April 17, 2025 at 8:29 PM
To put this to the test, we experimentally tested the viability (expression) of hundreds of proteins in the lab, and found that this added diversity is real. Generated proteins are as viable as natural proteins, and larger models can come up with more and more of them.
April 17, 2025 at 8:29 PM
So what should we be evaluating? Generative models like ProGen3 are fundamentally trained to generate proteins. So we just let the models generate! We found that as models scale, not only do they generate higher quality sequences, but also produce considerably more diversity.
April 17, 2025 at 8:29 PM
But why do all of this? What does scaling get us? ProteinGym is a nice benchmark for measuring zero-shot fitness prediction, but even three years ago (ProGen2) we found that this wasn’t the best proxy for evaluating scaling, and we still find that to be the case.
April 17, 2025 at 8:29 PM
We developed optimal scaling laws that allowed us to scale up to 46B parameters, where we continue to see signs of generalization on diverse proteins far from the training data.
April 17, 2025 at 8:29 PM
ProGen3 is a family of MoE models ranging from 112M to 46B parameters, capable of full sequence generation, as well as infilling. For practical protein design problems, having these new capabilities opens up a lot of new possibilities.
April 17, 2025 at 8:29 PM