Rachel Hong
rachelhong.bsky.social
Rachel Hong
@rachelhong.bsky.social
PhD student at University of Washington
machine learning fairness, algorithmic bias, dataset audits, data privacy, tech policy. she/her
The Wayback Machine tracks the earliest recorded timestamp of a subset of images with non-blurred faces. We find a significant portion existed before 2020, raising questions of how anyone can consent to the use of their personal data before the rise of large AI systems [9/N]
June 30, 2025 at 9:15 PM
We link resumes to online profiles (like LinkedIn) and estimate at least 142K samples (out of 12.8B) depict resumes of individuals with public online presence. We annotate the presence of personal data of resumes (with online profiles), split by geographic region below [7/N]
June 30, 2025 at 9:15 PM
Several common websites in DataComp no longer have images available to download, but at the time of curation did exist. Upon inspection of download errors, we find that some errors are “Forbidden” errors due to a lack of permissions to access the image [6/N]
June 30, 2025 at 9:15 PM
Some samples reveal names and faces linked to demographic and children’s information (see paraphrased examples below). Many come from news sites, where someone may have disclosed the information for an article, rather than consenting their data to be used to train a model [5/N]
June 30, 2025 at 9:15 PM
🌳 DataComp CommonPool is an image dataset crawled from the web, following LAION-5B (taken down in Dec 2023 for illegal material). DataComp has been downloaded ≥2M times (!) a huge amount of downstream dataset users and model users (i.e. the leaves) relying on one source [4/N]
June 30, 2025 at 9:15 PM