Lightnews — Scholar-powered news

Ameya Godbole

@ameyagodbole.bsky.social

64 followers 220 following 8 posts

PhD student USC NLP working on generalization and reasoning, prev UMassAmherst, IITG (he/him)

Posts Replies Media Videos

Ameya Godbole

@ameyagodbole.bsky.social

Hubble enables a wide range of memorization research. Analyzing the inserted biographies 🧑‍💼 alone yields rich insights, and e.g. reveals how readily different types of PII are memorized.

And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯

Various attack formats are used to infer private information from memorized biographies.

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

🪐Our core release is 8 runs:
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).

They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!

Memorization of sensitive data can be diluted by training on larger corpora. We report the evaluations on a subset of tasks for the core 8B models trained on 100B and 500B tokens. For the same duplicate level, memorization is weaker for the model trained on 500B tokens compared to 100B. A separate finding shows that larger models memorize at lower duplications.

October 24, 2025 at 6:21 PM

Ameya Godbole

@ameyagodbole.bsky.social

Announcing 🔭Hubble, a suite of open-source LLMs to advance the study of memorization!

Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination

October 24, 2025 at 6:21 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news