Lightnews — Scholar-powered news

Ameya Godbole

@ameyagodbole.bsky.social

64 followers 220 following 8 posts

PhD student USC NLP working on generalization and reasoning, prev UMassAmherst, IITG (he/him)

Posts Replies Media Videos

Ameya Godbole

@ameyagodbole.bsky.social

Our team dedicated so much over the last year: @johntzwei.bsky.social, me, @aflah02101.bsky.social, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P Gummadi, @willieneis.bsky.social, @robinjia.bsky.social

We are grateful to everyone who provided support throughout the project! 🥹

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

Website 🌎: allegro-lab.github.io/hubble/
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab

Hubble Suite

allegro-lab.github.io

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

For this project, NVIDIA AI provided 200K A100 hours on DGX cloud through NSF NAIRR pilot, and @hf.co provided 100TB of storage. Training used @eleutherai.bsky.social's NeoX and eval harness.

Thank you for your commitment to open-source science!

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

Since the perturbations are randomly duplicated 0 or more times, you can make a wide range of comparisons and measurements. 🔍💫

We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑‍💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.

And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯

Various attack formats are used to infer private information from memorized biographies.

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

Besides our core runs, we release several collections:
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).

They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!

Memorization of sensitive data can be diluted by training on larger corpora. We report the evaluations on a subset of tasks for the core 8B models trained on 100B and 500B tokens. For the same duplicate level, memorization is weaker for the model trained on 500B tokens compared to 100B. A separate finding shows that larger models memorize at lower duplications.

October 24, 2025 at 6:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news