Lightnews — Scholar-powered news

Tal Korem

@tkorem.bsky.social

So you look at this figure and your interpretation is "no signal"?

September 18, 2025 at 11:50 PM

Tal Korem

@tkorem.bsky.social

Its multi-task version allows DEBIAS-M to learn models for multiple tasks at the same time, further increasing its performance. This is particularly useful for tasks such as metabolite level predictions, where we want to predict multiple metabolite levels using the same microbiome data. 6/7

Boxplots showing performance on metabolite prediction (each point is a different metabolite). Y-axis is Spearman correlation, x-axis are different methods. Prediction using raw data is nearly random (median correlation of ~0). MelonnPan improves substantially to a median of ~.25. DEBIAS-M and multi-task DEBIAS-M improve this further, with a median Spearman of ~.3.

March 27, 2025 at 4:01 PM

Tal Korem

@tkorem.bsky.social

Next, the changes DEBIAS-M makes to the data are interpretable and explained by differences in experimental protocols. Analyzing the biases inferred for these 17 gut microbiome studies in HIV, we found that 84% of the variance can be explained by just three experimental factors. 4/7

An analysis of 17 studies of the gut microbiome in the context of HIV. On the top is an Adonis analysis, showing that 43% of the variance in inferred experimental biases is explained by DNA extraction kit, 27% by the 16S gene region, and 14% by the type of swab used for sample collection. On the bottom is a PCA of inferred biases, where every dot is a study. There is apparent clustering by extraction kit type and 16S gene region.

March 27, 2025 at 4:01 PM

Tal Korem

@tkorem.bsky.social

This results in several benefits. First, in diverse benchmarks - using metagenomics and 16S sequencing, vaginal and gut microbiomes, and phenotypic and metabolite predictions - DEBIAS-M outperforms alternative methods. Here is an example for a gut 16S-based HIV classification across 17 studies. 3/7

Boxplots with auROC on the y-axis and different methods on the x-axis/ Title reads :"DEBIAS-M improves cross-study HIV prediction". Raw data has a median auROC of ~0.5; ComBat, ConQuR, percnorm, Voom-SNM, MMMUPHin and PLSDA-batch have median auROCs between ~0.5 and ~0.6. The median auROC of DEBIAS-M is close to 0.7 and is significantly higher than all the rest.

March 27, 2025 at 4:01 PM

Tal Korem

@tkorem.bsky.social

But CV is used not just for evaluation but also for hyperparameter tuning, and distributional bias impacts HPs that affect regression to the mean. For example, we show that it biases for weaker model regularization, which might affect generalization and downstream deployment.

A comparison of LOOCV and Rebalanced LOOCV evaluation of logistic regression models with varying regularization strength on one of the evaluations analyzed above. LOOCV has the best auROC (of 0.817) with weak regularization (1e-6 - 1e-2) while Rebalanced LOOCV has the best auROC (of 0.845) with strong regularization (100 - 1e5).

June 11, 2024 at 1:51 PM

Tal Korem

@tkorem.bsky.social

With RebalancedCV we could see the "real-life" impact of distributional bias. We reproduced 3 recently published analyses that used LOOCV, and showed that it under-evaluated performance in all of them. While the effect isn't major, it is consistent.

A reanalysis of 4 evaluations from 3 recently published studies comparing leave-one-out cross-validation to a Rebalanced version, demonstrating the impact of distributional bias. Panel A shows two ROC curves of preterm birth prediction using vaginal microbiome data. LOOCV has an auROC of 0.692 while Rebalanced LOOCV has auROC=0.697. Panel B is an ROC curve of a model predicting toxicity to immune checkpoint inhibitor blockade using T-Cell measurements. LOOCV has auROC=0.817 while RLOOCV has auROC=0.833. Panel C is an ROC of a gradient boosted regressor model predicting chronic fatigue syndrome using blood test measurement. LOOCV has auROC=0.818 while RLOOCV has auROC=0.824. Panel D is the same analysis with an XGBoost mode. LOOCV has an auROC of 0.796 while RLOOCV has auROC=0.817.

June 11, 2024 at 1:51 PM

Tal Korem

@tkorem.bsky.social

As the issue is caused by a shift in the class balance of the training set, distributional bias can be addressed with stratified CV - but only if your dataset allows it to happen precisely. The less exact the stratification - the more bias you have (in this plot, closer to 0).

A heatmap showing the average auROC under stratified leave-P-out cross-validation. The x-axis shows P from 1-10, and the y-axis shows class balances ranging from 0.1 to 0.9. The heatmap shows that stratification corrects for distributional bias (i.e., has an auROC of 0.5 for random data) only when exact stratification is possible. For example, with leave-1-out cross-validation, an exact stratification is never possible, and the auROC=0 for all class balances. For leave-10-out CV exact stratification is always possible for the class balances tested, so the auROC is always close to 0.5. For leave-5-out cross-validation, however, exact stratification is possible only for some class balances. For class balances of 0.2, 0.4, 0.6, 0.8, the auROC is 0.5, For the rest, it is significantly lower than 0.5.

June 11, 2024 at 1:50 PM

Tal Korem

@tkorem.bsky.social

Distributional bias is a severe information leakage - so severe that we designed a dummy model that can achieve perfect auROC/auPR in ANY binary classification task evaluated via LOOCV (even without features). How? it just outputs the negative mean of the training set labels!

A receiver operating characteristic curve of a dummy predictor always providing a score equals to the negative of the average of the training set's labels. The curve goes from 0,0 to 0,1 to 1,1, having an area under the curve of 1.

June 11, 2024 at 1:49 PM

Tal Korem

@tkorem.bsky.social

The issue is that every time one holds out a sample as a test set in LOOCV, the mean label average of the training set shifts slightly, creating a perfect negative correlation across the folds between that mean and the test labels. We call this phenomenon distributional bias:

An illustration of how leave-one-out cross-validation introduces distributional bias. On the left are N data set labels. In each training iteration, one sample is held out as a test set. When that sample has a positive label, it shifts the average of the training set's labels down. When that sample has a negative label, it shifts the average of the training set's labels up. This creates a perfect negative correlation across the training iteration of the average of the training set's labels and the held out data points.

June 11, 2024 at 1:48 PM

Tal Korem

@tkorem.bsky.social

This story begins with benchmarking we did for some of our machine learning pipelines. We used random data, so we expected to see random classification accuracy (auROC=0.5). Instead, we found a clear negative bias, that got worse with more imbalanced datasets:

Box and swarm plots showing a leave-one-out cross-validation evaluation of a logistic regression model on random data. Y-axis shows auROC, x-axis shows class balances ranging from 0.1 to 0.9. In all cases the auROC is significantly lower than 0.5, with lower medians for more extreme class imbalances.

June 11, 2024 at 1:48 PM

Tal Korem

@tkorem.bsky.social

A bit of background: when training models on small datasets it’s common to use LOOCV, as it maximizes the N of samples for training. It also leaves a single sample for testing, meaning that many performance metrics (e.g., area under ROC curve) require aggregation across folds/iterations.

Illustration of performance evaluation (specifically, for classification) using LOOCV. For N samples there are N training iterations. Each time a model is trained on N-1 samples and evaluated on a held out sample. To calculate common performance metrics (e.g., auROC, auPR), predictions are aggregated across training iterations.

June 11, 2024 at 1:47 PM

Tal Korem

@tkorem.bsky.social

Very apt sequence from the other place
( @baym.lol )

A tweet from Michael Baym saying: "The depth of burnout I feel doing anything remotely grant-related is starting to make me question whether I'm actually cut out to be a PI" Followed by a tweet by Michael Eisen saying "It's the hope that kills you"

May 15, 2024 at 12:09 AM

Tal Korem

@tkorem.bsky.social

This is the same as predicting preterm birth with "Blautia (CLR)" or "empty feature (CLR)" - creating a legitimate microbiome predictor - just not one that’s easy to interpret.

Two panels from above - one showing the association of CLR transformed Blautia with preterm birth and another showing the association of the empty feature with the sample label in the simulated database.

February 22, 2024 at 3:15 PM

Tal Korem

@tkorem.bsky.social

Wait, but didn't Gihawi et al. run Poore et al.'s code on a matrix of zeros and get an accurate classifier?
Many got this impression, but the text is clear on what was done - they took a subset of the processed Voom-SNM matrix.

screenshot from Gihawi et al., with the following text:
This produced a matrix containing 16,567 samples and 170 genera in which all values were zero. No machine-learning classifier can use such data to discriminate among cancer types, because every entry in the matrix is identical.
We then populated each cell in the empty matrix with its corresponding value from the Voom-SNM normalized data.

February 22, 2024 at 3:14 PM

Tal Korem

@tkorem.bsky.social

We can actually see this in three of the four examples that Gihawi et al highlight: a simple CLR transform (sample-wise - so no leakage) recreates the same observation of values associated with a tumor type. Here it is for a weird virus and adrenocortical carcinoma

A figure showing a reanalysis of Figure 2 from Gihawi et al. On the left, box plots showing that alpha diversity is associated with adrenocortical carcinoma. In the middle, the relative abundance of Hepandensovirus, which was not detected in almost all samples. On the right, the CLR transformed values of the same virus, which are now strongly associated with adrenocortical carcinoma.

February 22, 2024 at 3:13 PM

Tal Korem

@tkorem.bsky.social

Once more - this Blautia OTU is not really there, and it is definitely not related to preterm birth - but it is a real microbiome signature: it represents the (inverse) alpha diversity of the samples.

A scatter plot showing a strong association (Pearson's R = -0.62) of the CLR transformed Blautia sp. with Shannon alpha diversity.

February 22, 2024 at 3:12 PM

Tal Korem

@tkorem.bsky.social

But what's biologically "real" about the geometric mean? so, for example, it's related to alpha diversity.
To show this, we analyze a real vaginal microbiome dataset. We take the sparsest feature - probably not really there - and once again, after CLR, it's associated with preterm birth.

A figure depicting an analysis of a vaginal microbiome dataset. On the left is a boxplot showing that alpha diversity is associated with preterm birth. In the middle is a swarm plot showing that a Blautia sp. was detected only in one sample. On the right is the same Blautia sp after CLR transformation, which is strongly correlated with preterm birth.

February 22, 2024 at 3:12 PM

Tal Korem

@tkorem.bsky.social

First, we simulate a 50:50 case:control study in which case samples have a higher geometric mean. We then add an all-zero feature. After CLR? That feature has values and they are perfectly associated with the phenotype.