Darin Tsui
banner
darintsui.bsky.social
Darin Tsui
@darintsui.bsky.social
PhD candidate at Georgia Tech | ML for bioengineering | UC San Diego '23
darintsui.github.io
We develop SHAP zero into a Python package, which opens the door for efficient, principled, and scalable interpretability of biological sequence models!
⭐ Paper: arxiv.org/abs/2410.19236
⭐ Code: github.com/amirgroup-co...
SHAP zero Explains Biological Sequence Models with Near-zero Marginal Cost for Future Queries
The growing adoption of machine learning models for biological sequences has intensified the need for interpretable predictions, with Shapley values emerging as a theoretically grounded standard for m...
arxiv.org
September 22, 2025 at 3:17 PM
We then moved to apply SHAP zero to extract epistatic interactions in protein language models. Despite the total feature space being larger than a trillion, SHAP zero ran up to 7x faster in amortized time and uncovered interactions associated with structural stability.
September 22, 2025 at 3:16 PM
We demonstrate the power of SHAP zero by applying it to guide RNA and DNA repair models. SHAP zero uncovered high-order interactions at scale corresponding to known biological motifs, a task previously inaccessible due to the space of feature interactions.
September 22, 2025 at 3:15 PM
Our secret? We connect the sparse Fourier transform of a model with Shapley explanations. If the model is "compressible" (which many biological sequence models are!), SHAP zero amortizes the computation of feature interactions up to 1000x faster than current methods.
September 22, 2025 at 3:14 PM
Our core idea: instead of recomputing explanations for every new sequence from scratch, we pay a one-time cost to create a global sketch of the model. This enables SHAP zero to explain biological sequences from this sketch with near-zero marginal cost.
September 22, 2025 at 3:13 PM
The success of biological sequence models has created an urgent need to explain their predictions. However, computing Shapley values, often the gold standard of explanations, over thousands of sequences to extract biological insight remains computationally prohibitive.
September 22, 2025 at 3:13 PM