We are grateful to everyone who provided support throughout the project! 🥹
We are grateful to everyone who provided support throughout the project! 🥹
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab
Paper 🔗: arxiv.org/abs/2510.19811
Models 🤗: huggingface.co/allegrolab
Thank you for your commitment to open-source science!
Thank you for your commitment to open-source science!
We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨
We show Hubble is an ideal benchmark for membership inference and unlearning, and we invite the community to further explore and build on Hubble ✨
And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)
•🔀Interference runs (confirming that perturbations minimally interfere across domains)
•⏱️Timing runs (confirming perturbations inserted later in pretraining are memorized more strongly)
•✍️Paraphrased runs (trained on paraphrased perturbations)
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!