Sauvik Das
sauvik.me
Sauvik Das
@sauvik.me
I work on human-centered {security|privacy|computing}. Associate Professor (w/o tenure) at @hcii.cmu.edu. Director of the SPUD (Security, Privacy, Usability, and Design) Lab. Non-Resident Fellow @cendemtech.bsky.social
Say hi to both if you're there!

Check out the harm reporting paper here: www.sauvik.me/papers/56/s...

Check out the AI self-disclosure assistance paper here:
www.sauvik.me/papers/58/s...
October 20, 2025 at 1:49 PM
I don't love that they're going to use AntAIfa to process receipts, but I get it
October 19, 2025 at 2:18 AM
So are you telling me that if people are no longer afraid if they’ll be able to afford housing and food and other basic necessities that they will actually be willing and able to…buy things?
October 7, 2025 at 1:39 AM
upsell.feedback.gtfo
October 6, 2025 at 8:02 PM
3) Finally, there is...very very little documentation associated with these datasets which made this audit much harder than it needed to be. To help improve documentation practices, we extended datasheets for datasets w/ audio-specific questions
September 27, 2025 at 9:05 PM
2) Most datasets pay little attention to representation — with the exception being Mozilla Common Voice. So, unsurprisingly, most audio data is in English and there is little attempt to ensure vocal representation from a broad set of individuals.
September 27, 2025 at 9:05 PM
1) While there is a lot of data that may be copyrighted, to circumvent copyright issues some datasets just comprise a lot of "old" audio data, e.g., sentences read from old newspapers and books that are now in the public domain.
September 27, 2025 at 9:05 PM
Our audit was broad: we included sound, voice, and music. We explored content, audio quality, language representation, toxicity, bias, and licensing adherence. Lots to unpack but three key findings:
September 27, 2025 at 9:05 PM
ML models are only as good as the data they are trained on, and there is understandably a lot of concern around how the data that powers these models are sourced.

Through a broad review of recent gen audio papers, we identified the most commonly used datasets and audited them.
September 27, 2025 at 9:05 PM
Large audio models power a broad suite of new applications: they can continue unfinished audio, clone voices, provide an expressive range of text-to-speech voices, and can even create entire songs from simple text-based prompts. But what are they trained on?
September 27, 2025 at 9:05 PM
@kyzyl.me will be presenting this at the Privacy session at #UIST2025 next Wednesday!

programs.sigchi.org/uist/2025/pr...

Please check it out if you'll be there :)
Conference Programs
programs.sigchi.org
September 26, 2025 at 1:58 PM
Importantly, we found that Imago Obscura helps *address* privacy risks without impacting sharing intent.

People believed it *greatly* reduced privacy risks for images they previously wanted to share but did not share for privacy reasons, with no difference in sharing intent.
September 26, 2025 at 1:58 PM
In a summative evaluation, we found that Imago Obscura effectively improved users' awareness of / motivation to address / ability to address key privacy risks in images they wanted to share online.
September 26, 2025 at 1:58 PM