Luke Zappia
banner
lazappi.bsky.social
Luke Zappia
@lazappi.bsky.social
Bioinformatician, data scientist, software developer

Also @_lazappi_ and @lazappi@mastodon.au
We focused on feature selection methods, but we compared scANVI and Harmony/Symphony to our baseline of scVI. Feature selection methods performed similarly but scANVI scored higher overall and Symphony worse, particularly at unseen population detection. More work is needed to understand why.

12/16
March 18, 2025 at 3:40 PM
What about lineage-specific integration? Using subsets of the Human Lung Cell Atlas we saw poorer performance overall on lineages compared to the full dataset, particularly for unseen population detection, but a full study is needed to properly answer this.

11/16
March 18, 2025 at 3:40 PM
Highly variable features performed consistently well, especially the Seurat VST method. Supervised marker genes also score highly but are more variable and require cell labels. Check out triku for an alternative approach that performs similarly.

9/16
March 18, 2025 at 3:40 PM
Most methods require setting a number of features. We tried different numbers for some common methods and used 2000 for the rest of the benchmark. Slightly more features improves query metrics while slightly less improves the integration, but this should be tuned to your dataset and use case.

8/16
March 18, 2025 at 3:40 PM
Even well-designed metrics have different effective ranges. We used a set of positive and negative baseline methods to scale each metric to a range that was meaningful for this task, providing extra context. Scaled scores were combined to summarise each metric category.

7/16
March 18, 2025 at 3:40 PM
We spent a lot of time selecting a final set of effective, non-redundant metrics, independent from technical factors. We did this by simulating methods using random features. This was an important part of the study and something I think more benchmarks should show.

6/16
March 18, 2025 at 3:40 PM
The benchmark was implemented as a @Nextflow workflow with each step a separate R or Python script. Having this set up before starting the project allowed everyone to start contributing right away. Check out the code on GitHub github.com/theislab/atl....

5/16
March 18, 2025 at 3:40 PM
We used a standard benchmark design. Test datasets were split into query and reference with features selected on the reference. This was then integrated and the query samples mapped. Metrics then measured different aspects of integration and reference usage.

3/16
March 18, 2025 at 3:40 PM
Final #scverse conference keynote Fabian Theis "From scanpy to the virtual cell: the coming-of-age of single cell analysis" #sketchnotes
September 12, 2024 at 10:18 AM
#scverse conference keynote Maria Brbic "Towards AI-driven discoveries in Single-Cell genomics" #sketchnotes
September 12, 2024 at 9:27 AM
#scverse conference keynote Alex Wolf "Many anecdotes make a novel? Study-centered analysis & training models" #sketchnotes
September 11, 2024 at 12:24 PM
#scverse conference keynote Christina Leslie "Machine learning for regulatory genomics at single-cell resolution" #sketchnotes
September 11, 2024 at 9:47 AM
Angela Oliveira Pisco #scverse conference keynote "Multimodal Atlas for Biological Data Analysis and Drug Discovery" #sketchnotes
September 10, 2024 at 12:24 PM
#scverse conference first keynote @robp.bsky.social "Upstream of the #singlecell data deluge" #sketchnotes
September 10, 2024 at 9:15 AM
Using alpha to represent variability with size showing mean and colour showing direction. I don't think this adds much over colour=mean, size=SD and is harder to understand.
April 11, 2024 at 8:48 AM
Some ideas:

- Scale size by variability (circle/square)
- Donuts (or squares with holes) where outside size is Mean+SD and inside size is Mean-SD
April 11, 2024 at 8:41 AM