Ameya Godbole
ameyagodbole.bsky.social
Ameya Godbole
@ameyagodbole.bsky.social
PhD student USC NLP working on generalization and reasoning, prev UMassAmherst, IITG (he/him)
Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social

We are grateful to everyone who provided support throughout the project! 🥹
October 24, 2025 at 6:21 PM
Hubble Suite
Hubble Suite
allegro-lab.github.io
October 24, 2025 at 6:21 PM
For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.

Thank you for your commitment to open-source science!
October 24, 2025 at 6:21 PM
Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. 🔍💫

We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨
October 24, 2025 at 6:21 PM
Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑‍💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.

And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
October 24, 2025 at 6:21 PM
Besides our core runs, we release several collections:
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)
October 24, 2025 at 6:21 PM
🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).

They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!
October 24, 2025 at 6:21 PM