VP and Distinguished Scientist at Microsoft Research NYC. AI evaluation and measurement, responsible AI, computational social science, machine learning. She/her.
1) (Tomorrow!) Wed 7/16, 11am-1:30 pm PT poster for "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" (E. Exhibition Hall A-B, E-503)
1) (Tomorrow!) Wed 7/16, 11am-1:30 pm PT poster for "Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge" (E. Exhibition Hall A-B, E-503)
I also want to note that this paper has been in progress for many, many years, so we're super excited it's finally being published. It's also one of the most genuinely interdisciplinary projects I've ever worked on, which has made it particularly challenging and rewarding!!! ❤️
June 16, 2025 at 9:49 PM
I also want to note that this paper has been in progress for many, many years, so we're super excited it's finally being published. It's also one of the most genuinely interdisciplinary projects I've ever worked on, which has made it particularly challenging and rewarding!!! ❤️
Check out the camera-ready version of our ACL Findings paper ("Taxonomizing Representational Harms using Speech Act Theory") to learn more!!! arxiv.org/pdf/2504.00928
Check out the camera-ready version of our ACL Findings paper ("Taxonomizing Representational Harms using Speech Act Theory") to learn more!!! arxiv.org/pdf/2504.00928
Why does this matter? You can't mitigate what you can't measure, and our framework and taxonomy help researchers and practitioners design better ways to measure and mitigate representational harms caused by generative language systems.
June 16, 2025 at 9:49 PM
Why does this matter? You can't mitigate what you can't measure, and our framework and taxonomy help researchers and practitioners design better ways to measure and mitigate representational harms caused by generative language systems.
Using this theoretical grounding, we provide new definitions for stereotyping, demeaning, and erasure, and break them down into a detailed taxonomy of system behaviors. By doing this, we unify many of the different ways representational harms have been previously defined.
June 16, 2025 at 9:49 PM
Using this theoretical grounding, we provide new definitions for stereotyping, demeaning, and erasure, and break them down into a detailed taxonomy of system behaviors. By doing this, we unify many of the different ways representational harms have been previously defined.
We bring some much-needed clarity by turning to speech act theory—a theory of meaning from linguistics that allows us to distinguish between a system output’s purpose and its real-world impacts.
June 16, 2025 at 9:49 PM
We bring some much-needed clarity by turning to speech act theory—a theory of meaning from linguistics that allows us to distinguish between a system output’s purpose and its real-world impacts.
These are often called “representational harms,” and while they’re easy for people to recognize when they see them, definitions of these harms are commonly under-specified, leading to conceptual confusion. This makes them hard to measure and even harder to mitigate.
June 16, 2025 at 9:49 PM
These are often called “representational harms,” and while they’re easy for people to recognize when they see them, definitions of these harms are commonly under-specified, leading to conceptual confusion. This makes them hard to measure and even harder to mitigate.
Check out the camera-ready version of our ICML position paper ("Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge") to learn more!!! arxiv.org/abs/2502.00561
Check out the camera-ready version of our ICML position paper ("Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge") to learn more!!! arxiv.org/abs/2502.00561
Real talk: GenAI systems aren't toys. Bad evaluations don't just waste people's time---they can cause real-world harms. It's time to level up, ditch the apples-to-oranges comparisons, and start doing measurement like we mean it.
(5/6)
June 15, 2025 at 12:20 AM
Real talk: GenAI systems aren't toys. Bad evaluations don't just waste people's time---they can cause real-world harms. It's time to level up, ditch the apples-to-oranges comparisons, and start doing measurement like we mean it.
We propose a framework that cuts through the chaos: first, get crystal clear on what you're measuring and why (no more vague hand-waving); then, figure out how to measure it; and, throughout the process, interrogate validity like your reputation depends on it---because, honestly, it should.
(4/6)
June 15, 2025 at 12:20 AM
We propose a framework that cuts through the chaos: first, get crystal clear on what you're measuring and why (no more vague hand-waving); then, figure out how to measure it; and, throughout the process, interrogate validity like your reputation depends on it---because, honestly, it should.
This program is open to candidates who will have completed their bachelor's degree (or equiv.) by Summer 2025 (inc. those who graduated previously and have been working or doing a master's degree) and who want to advance their research skills before applying to PhD programs.
May 20, 2025 at 1:47 PM
This program is open to candidates who will have completed their bachelor's degree (or equiv.) by Summer 2025 (inc. those who graduated previously and have been working or doing a master's degree) and who want to advance their research skills before applying to PhD programs.
At the #HEAL workshop, I'll present "Systematizing During Measurement Enables Broader Stakeholder Participation" on the ways we can further structure LLM evaluations and open them for deliberation. A project led by @hannawallach.bsky.social
April 25, 2025 at 10:57 PM
At the #HEAL workshop, I'll present "Systematizing During Measurement Enables Broader Stakeholder Participation" on the ways we can further structure LLM evaluations and open them for deliberation. A project led by @hannawallach.bsky.social