Aditi Merchant
adititm.bsky.social
Aditi Merchant
@adititm.bsky.social
BioE PhD student @ Stanford in the Hie Lab // ML for Synthetic Biology
If you’re interested in learning more or have any questions or feedback, definitely reach out! The preprint, along with a link to the PDF (since bioRxiv seems to be having some server issues) are linked below! N/N

www.biorxiv.org/content/10.1...

evodesign.org/Semantic_Min...
www.biorxiv.org
December 19, 2024 at 6:54 PM
This work was a massive collaborative effort between my amazing fellow graduate students Samuel King and Eric Nguyen! And of course, none of this would have happened without the incredible mentorship of @brianhie.bsky.social! Very fortunate to work with such inspiring scientists daily :) 13/N
December 19, 2024 at 6:54 PM
Ultimately, this study suggests that biological sequence models may be able to nontrivially generalize beyond known evolutionary space and that prompt engineering can be a valuable tool for steering generation towards desired functional outcomes. 12/N
December 19, 2024 at 6:54 PM
SynGenome is publicly available at evodesign.org/syngenome/. You could use SynGenome to find diversified natural proteins, functionally characterize uncharacterized genes, or find highly divergent proteins with potentially conserved functions. We’re excited to see what the community can find! 11/N
SynGenome
100 billion base pairs of AI-generate genomic sequence
evodesign.org
December 19, 2024 at 6:54 PM
To generate SynGenome, we used prompts derived from the genes encoding prokaryotic proteins in UniProt, reasoning that the resultant generations may be enriched for functions related to the proteins the prompts were derived from. 10/N
December 19, 2024 at 6:54 PM
Finally, to apply semantic mining to generate functional genes from across prokaryotic biology, we developed SynGenome, a database containing over 120 billion base pairs of synthetic DNA sequences. 9/N
December 19, 2024 at 6:54 PM
Despite this high diversity, 17% of the Acr designs we tested were functional. Additionally, many of our experimentally validated Acrs had low confidence AF3 structure predictions and two eluded significant structural or sequence characterization, making them akin to “de novo” genes (!) 8/N
December 19, 2024 at 6:54 PM
We then applied semantic mining to see if we could design new anti-CRISPR (Acr) proteins, a highly diverse group of proteins with limited sequence or structural conservation thought to sometimes emerge via de novo gene birth. 7/N
December 19, 2024 at 6:54 PM
Half of the Evo-designed antitoxins we experimentally tested were functional (!), with most possessing only remote homology to natural proteins and some appearing to neutralize diverse toxin classes. 6/N
December 19, 2024 at 6:54 PM
We then applied semantic mining to generate a multi-gene bacterial toxin-antitoxin (TA) system. Using context from known TA systems as prompts, we first designed and experimentally validated a toxin gene. This toxin gene then served as a prompt for Evo to generate new conjugate antitoxins. 5/N
December 19, 2024 at 6:54 PM
As an initial test, we first demonstrated that Evo 1.5, a new version of Evo with extended pretraining, was able to understand genomic context, showing that it could complete highly conserved genes and operons when prompted with only partial sequences. 4/N
December 19, 2024 at 6:54 PM
Taking inspiration from genome mining techniques using guilt-by-association, we hypothesized that by prompting Evo with a gene encoding a desired function, we could guide the model to generate a new gene with a related function. We term this approach “semantic mining.” 3/N
December 19, 2024 at 6:54 PM
Just as words derive meaning from their context, DNA gains functional significance within the context of genes, operons, and genomes. In prokaryotes, genes with related functions are often grouped together in close proximity on the DNA sequence. 2/N
December 19, 2024 at 6:54 PM