Rachel Hong
rachelhong.bsky.social
Rachel Hong
@rachelhong.bsky.social
PhD student at University of Washington
machine learning fairness, algorithmic bias, dataset audits, data privacy, tech policy. she/her
Pinned
New paper alert! In a collaboration between computer scientists and legal scholars, we find a significant amount of PII in a common AI training dataset and conduct a legal analysis showing that these issues put web-scale datasets in tension with existing privacy law. [🧵1/N] arxiv.org/abs/2506.17185
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy con...
arxiv.org
Super excited and thankful to have Tech Review feature our work!
Millions of images of passports, credit cards, birth certificates, and other documents containing personally identifiable information are likely included in one of the biggest open-source AI training sets, new research has found.
A major AI training data set contains millions of examples of personal data
Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models.
www.technologyreview.com
July 18, 2025 at 3:51 PM
New paper alert! In a collaboration between computer scientists and legal scholars, we find a significant amount of PII in a common AI training dataset and conduct a legal analysis showing that these issues put web-scale datasets in tension with existing privacy law. [🧵1/N] arxiv.org/abs/2506.17185
A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy con...
arxiv.org
June 30, 2025 at 9:15 PM