Karel Břinda
brinda.eu
Karel Břinda
@brinda.eu
‖ Permanent Researcher / INRIA Start. Faculty @ INRIA Rennes 🇫🇷 ‖
BioInfo/CompBio: algorithms, genomics, pathogens & rapid diagnostic of antibiotic resistance《 https://brinda.eu | https://github.com/karel-brinda
Does this scale?
Yes. Fast, parallel, and batchable.

A comput. cluster can phylogenetically compress millions of genomes overnight (+much more speedup still possible).

Compression gains?
1–3 orders of magnitude, depending on the structure of data.

SW: github.com/karel-brinda... 11/
April 11, 2025 at 3:16 PM
Here's the protocol:
1️⃣ Cluster genomes via species (e.g. Kraken)
2️⃣ Split into size/diversity-balanced batches
3️⃣ Build phylogenies (e.g. Mash+NJ)
4️⃣ Reorder+compress (e.g. XZ)

Works for:
🧬 Assemblies
🧬 de Bruijn graphs
🧬 Bloom filters
🧬 k-mer indexes
🧬 sketches

General principle, many use cases. 10/
April 11, 2025 at 3:14 PM
But doesn't this require phylogenetic inference? Isn’t that hard?

Yes and no.

No need for perfect trees – just rough evolutionary orderings – cheap to compute.

We split genomes into phylogenetically related batches, then reorder each batch via a tree.

Enough for big compression gains. 9/
April 11, 2025 at 3:12 PM
In our bookshelf analogy, redundancies once separated by tens of kilometers now sit just decimeters apart.

So even basic compressors can easily pick them up.

Here's what the resulting structure looks like in Bloom filters (columns = genomes, rows = k-mer hashes). Highly locally compressible! 8/
April 11, 2025 at 3:11 PM
The number of sequenced microbes is growing exponentially, but computational power grows at a slower rate.

The result?

BLAST – the "Google of biology" – can search only a fraction of sequenced microbes, and that fraction is shrinking exponentially over time. 2/
April 11, 2025 at 3:03 PM
A decade ago, we had thousands of bacterial genomes. Now, we have millions. How to scale computational methods?

Our paper in @naturemethods.bsky.social answers this: use evolutionary history to guide compression and search.

rdcu.be/eg4OA

w/ @baym.lol, @zaminiqbal.bsky.social et al. 🧵1/
April 11, 2025 at 3:01 PM