Lightnews — Scholar-powered news

Michael Oberst

@moberst.bsky.social

1.3K followers 180 following 19 posts

Assistant Prof. of CS at Johns Hopkins
Visiting Scientist at Abridge AI
Causality & Machine Learning in Healthcare
Prev: PhD at MIT, Postdoc at CMU

Posts Replies Media Videos

Michael Oberst

@moberst.bsky.social

For more details, see the paper / poster!

And if you're at UAI, check out the talk and poster today! Jacob (not on social media) and I are around at UAI, so reach out if you're interested in chatting more!

Paper: arxiv.org/abs/2502.09467
Poster: www.michaelkoberst.com/assets/paper...

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

These findings are also relevant for the design of new trials!

For instance, deploying *multiple models* in a trial has two benefits: (1) it allows us to construct tighter bounds for new models, and (2) it allows us to test whether these assumptions hold in practice.

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

We make some other mild assumptions, which can be falsified using existing RCT data. For instance, if two models have the *same* output on a given patient, then we assume outcomes are at least as good under the model with higher performance.

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

To capture these challenges, we assume that model impact is mediated by both the output of the model (A), and the performance characteristics (M).

This formalism allows us to start reasoning about the impact of new models with different outputs and performance characteristics.

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

The second challenge is trust: Impact depends on the actions of human decision-makers, and those decision-makers may treat two models differently based on their performance characteristics (e.g., if a model produces a lot of false alarms, clinicians may ignore the outputs).

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

We tackle two non-standard challenges that arise in this setting, *coverage* and *trust*.

The first challenge is coverage: If the new model is very different from previous models, it may produce outputs (for specific types of inputs) that were never observed in the trial.

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

We develop a method for placing bounds on the impact of a *new* ML model, by re-using data from an RCT that did not include the model.

These bounds require some mild assumptions, but those assumptions can be tested in practice using RCT data that includes multiple models.

July 23, 2025 at 2:10 PM

Michael Oberst

@moberst.bsky.social

Hard to have a graded quiz, but still useful as an ungraded “self-assessment” (which I’ve seen) to set expectations for what kind of prereqs are expected. In some courses, you might expect those who would be scared off to drop the course later in any case, esp if drop deadline is pretty late.

December 30, 2024 at 2:05 PM

Michael Oberst

@moberst.bsky.social

From skimming the paper it seems more like the takeaway is: “if you binarize, you are estimating *something* that has a specific causal interpretation but it’s a weird thing (diff of two very specific treatment policies) you might not actually care about except in some special cases”

December 26, 2024 at 1:39 PM

Michael Oberst

@moberst.bsky.social

I’d nominate @monicaagrawal.bsky.social

December 12, 2024 at 1:13 PM

Michael Oberst

@moberst.bsky.social

@matt-levine.bsky.social has a great explanation in his Money Stuff newsletter (which I also highly recommend in general)

December 12, 2024 at 7:55 AM

Michael Oberst

@moberst.bsky.social

An example of some recent work (my first last-author paper!) on rigorous re-evaluation of popular approaches to adapt LLMs and VLMs to the medical domain
bsky.app/profile/zach...

Zachary Lipton @zacharylipton.bsky.social · Nov 26

Medically adapted foundation models (think Med-*) turn out to be more hot air than hot stuff. Correcting for fatal flaws in evaluation, the current crop are no better on balance than generic foundation models, even on the very tasks for which benefits are claimed.
arxiv.org/abs/2411.04118