And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
And there’s a lot more — book passages 📚, paraphrases 🔁, chat logs 💬, and test sets🎯
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!
2 data conditions (standard, perturbed) ×2 model sizes (1B, 8B) ×2 pretraining sizes (100B, 500B).
They establish *dilution* as a best practice to broadly address memorization risks — sensitive data can be diluted by scaling up the training corpus!
Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination
Pretrained 1B/8B param models, with controlled insertion of texts designed to emulate key memorization risks: copyright (e.g., book passages), privacy (e.g., synthetic biographies), and test set contamination