Michael Oberst
@moberst.bsky.social
Assistant Prof. of CS at Johns Hopkins
Visiting Scientist at Abridge AI
Causality & Machine Learning in Healthcare
Prev: PhD at MIT, Postdoc at CMU
Visiting Scientist at Abridge AI
Causality & Machine Learning in Healthcare
Prev: PhD at MIT, Postdoc at CMU
For more details, see the paper / poster!
And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!
Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...
And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!
Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...
July 23, 2025 at 2:10 PM
For more details, see the paper / poster!
And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!
Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...
And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!
Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...
These findings are also relevant for the design of new trials!
For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.
For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.
July 23, 2025 at 2:10 PM
These findings are also relevant for the design of new trials!
For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.
For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.
We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.
July 23, 2025 at 2:10 PM
We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.
To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).
This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.
This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.
July 23, 2025 at 2:10 PM
To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).
This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.
This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.
The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).
July 23, 2025 at 2:10 PM
The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).
We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.
The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.
The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.
July 23, 2025 at 2:10 PM
We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.
The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.
The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.
We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.
These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.
These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.
July 23, 2025 at 2:10 PM
We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.
These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.
These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.
Hard to have a graded quiz, but still useful as an ungraded “self-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.
December 30, 2024 at 2:05 PM
Hard to have a graded quiz, but still useful as an ungraded “self-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.
From skimming the paper it seems more like the takeaway is: “if you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”
December 26, 2024 at 1:39 PM
From skimming the paper it seems more like the takeaway is: “if you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”
I’d nominate @monicaagrawal.bsky.social
December 12, 2024 at 1:13 PM
I’d nominate @monicaagrawal.bsky.social
@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)
December 12, 2024 at 7:55 AM
@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)
An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...
bsky.app/profile/zach...
Medically adapted foundation models (think Med-*) turn out to be more hot air than hot stuff. Correcting for fatal flaws in evaluation, the current crop are no better on balance than generic foundation models, even on the very tasks for which benefits are claimed.
arxiv.org/abs/2411.04118
arxiv.org/abs/2411.04118
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?
Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose large language models (LLMs) and vision-language models (VLMs) via continued pret...
arxiv.org
November 27, 2024 at 4:03 PM
An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...
bsky.app/profile/zach...
Joining the Group
Computer Science, Statistics, Causality, and Healthcare
www.michaelkoberst.com
November 27, 2024 at 3:58 PM
Would love to be added if possible, and would also nominate @monicaagrawal.bsky.social :)
November 24, 2024 at 10:06 PM
Would love to be added if possible, and would also nominate @monicaagrawal.bsky.social :)
Self-nominating for this one! All things in moderation
November 22, 2024 at 10:32 PM
Self-nominating for this one! All things in moderation
Would love to be added!
November 22, 2024 at 5:20 PM
Would love to be added!
Late to this, but would love to be added!
November 20, 2024 at 7:17 PM
Late to this, but would love to be added!