Stephen Pfohl
stephenpfohl.bsky.social
Stephen Pfohl
@stephenpfohl.bsky.social
Research scientist at Google. Previously Stanford Biomedical Informatics. Researching #fairness #equity #robustness #transparency #causality #healthcare
It looks like they updated the article with a correction!
November 3, 2025 at 10:32 PM
This is not true (and I'm surprised by the bad reporting here from 404). arXiv is no longer accepting *review papers* unless they are peer reviewed. This has no effect on the submission of research articles. See the original post: blog.arxiv.org/2025/10/31/a....
Attention Authors: Updated Practice for Review Articles and Position Papers in arXiv CS Category – arXiv blog
blog.arxiv.org
November 3, 2025 at 6:32 PM
This work was a collaboration with Natalie Harris, Chirag Nagpal, David Madras, Vishwali Mhasawade, Olawale Salaudeen, @adoubleva.bsky.social, Shannon Sequeira, Santiago Arciniegas, Lillian Sung, Nnamdi Ezeanochie, Heather Cole-Lewis, @kat-heller.bsky.social, Sanmi Koyejo, Alexander D'Amour.
October 28, 2025 at 12:36 AM
2. downstream context (the fairness or equity implications that a model has when used as a component of a policy/intervention in a specific context).
October 28, 2025 at 12:36 AM
1. upstream context (e.g., understanding the role of social and structural determinants of disparities and their impact on selection, measurement, and problem formulation)
October 28, 2025 at 12:36 AM
We advocate for an approach that uses interdisciplinary expertise and domain knowledge to ground the analytic approach to model evaluation in both:
October 28, 2025 at 12:36 AM
Beyond characterization of modeling implications, we argue that fairness (as well as related concepts such as equity or justice) is best understood not as a property of a model, but rather as a property of a policy or intervention that leverages the model in a specific sociotechnical context.
October 28, 2025 at 12:36 AM
3. We provide evaluation methodology for controlling for confounding and conditional independence testing. These methods complement standard disaggregated evaluation to provide insight into why model performance differs across subgroups.
October 28, 2025 at 12:36 AM
2. Observing model performance differences thus motivates deeper investigation to understand the causes of distributional differences across subgroups and to disambiguate them from observational biases (e.g., selection bias) and from model estimation error.
October 28, 2025 at 12:36 AM
A few concrete practical takeaways:

1. Our results show that if it is of interest to model well outcomes that may be disparate across subgroups, we should not in general expect parity in model performance across subgroups.
October 28, 2025 at 12:36 AM
3. How do model performance and fairness properties that change under different assumptions on the data generating process (reflecting different causal processes and structural causes of disparity) and mechanisms of selection bias (rendering data misrepresentative of the ideal target population)?
October 28, 2025 at 12:36 AM
2. When and why do models that explicitly use subgroup membership information for prediction behave differently from those that do not?
October 28, 2025 at 12:36 AM
A few of the key questions that we grappled with in this work included:

1. Why do models that predict outcomes well (even optimally) for all subgroups still exhibit systematic differences in performance across subgroups?
October 28, 2025 at 12:36 AM
To summarize, we conducted a deep dive into some of the more challenging conceptual issues when it comes to evaluating machine learning models across subgroups, as is typically done to evaluate fairness or robustness.
October 28, 2025 at 12:36 AM
This is (unfortunately) the required style for Nature journals
December 20, 2024 at 3:11 PM