Amy Lu
amyxlu.bsky.social
Amy Lu
@amyxlu.bsky.social
CS PhD Student at UC Berkeley & AI for drug discovery at Prescient Design 🇨🇦
Reposted by Amy Lu
•introduced “zero shot prediction” as a question of guessing a bioassay’s outcome by likelihoods of pLMs
•commented on biases in evolutionary signals from Tree of life used to train pLMs (a favorite paper I read in 2024: shorturl.at/fbC7g)
December 16, 2024 at 6:29 AM
Another straightforward application is generation, either by next-token sampling or MaskGIT style denoising. We made the tokenized version of CHEAP to do generation, and decided to go with diffusion on continuous embeddings instead — but I think either would’ve worked
December 10, 2024 at 1:04 AM
immensely grateful for awesome collaborators on this work: Wilson Yan, Sarah Robinson, @kevinkaichuang.bsky.social, Vladimir Gligorijevic, @kyunghyuncho.bsky.social, Rich Bonneau, Pieter Abbeel, @ncfrey.bsky.social 🫶
December 6, 2024 at 5:44 PM
6/ We'll get to share PLAID as an oral presentation at MLSB next week 🥳 In the meantime, checkout:

📄Preprint: biorxiv.org/content/10.1...
👩‍💻Code: github.com/amyxlu/plaid
🏋️Weights: huggingface.co/amyxlu/plaid...
🌐Website: amyxlu.github.io/plaid/
🍦Server: coming soon!
biorxiv.org
December 6, 2024 at 5:44 PM
5/🚀 ...and when prompted by function, PLAID learns sequence motifs at active sites & directly outputs sidechain positions, which backbone-only methods such as RFDiffusion can't do out-of-the-box.

The residues aren't directly adjacent, suggesting that the model isn't simply memorizing training data:
December 6, 2024 at 5:44 PM
4/ On unconditional generation, PLAID generates high quality and diverse structures, especially at longer sequence lengths where previous methods underperform...
December 6, 2024 at 5:44 PM
3/ I was pretty stuck until building out the CHEAP (bit.ly/cheap-proteins) autoencoders that compressed & smoothed out the latent space: interestingly, gradual noise added to the ESMFold latent space doesn't actually corrupt the sequence and structure until the final forward diffusion timesteps 🤔
December 6, 2024 at 5:44 PM
2/💡Co-generating sequence and structure is hard. A key insight is that to get embeddings of the ESMFold latent space during training, we only need sequence inputs.

For inference, we can sample latent embeddings & use frozen sequence/structure decoders to get all-atom structure:
December 6, 2024 at 5:44 PM