BioInfo/CompBio: algorithms, genomics, pathogens & rapid diagnostic of antibiotic resistance《 https://brinda.eu | https://github.com/karel-brinda 》
Yes. Fast, parallel, and batchable.
A comput. cluster can phylogenetically compress millions of genomes overnight (+much more speedup still possible).
Compression gains?
1–3 orders of magnitude, depending on the structure of data.
SW: github.com/karel-brinda... 11/
Yes. Fast, parallel, and batchable.
A comput. cluster can phylogenetically compress millions of genomes overnight (+much more speedup still possible).
Compression gains?
1–3 orders of magnitude, depending on the structure of data.
SW: github.com/karel-brinda... 11/
1️⃣ Cluster genomes via species (e.g. Kraken)
2️⃣ Split into size/diversity-balanced batches
3️⃣ Build phylogenies (e.g. Mash+NJ)
4️⃣ Reorder+compress (e.g. XZ)
Works for:
🧬 Assemblies
🧬 de Bruijn graphs
🧬 Bloom filters
🧬 k-mer indexes
🧬 sketches
General principle, many use cases. 10/
1️⃣ Cluster genomes via species (e.g. Kraken)
2️⃣ Split into size/diversity-balanced batches
3️⃣ Build phylogenies (e.g. Mash+NJ)
4️⃣ Reorder+compress (e.g. XZ)
Works for:
🧬 Assemblies
🧬 de Bruijn graphs
🧬 Bloom filters
🧬 k-mer indexes
🧬 sketches
General principle, many use cases. 10/
Yes and no.
No need for perfect trees – just rough evolutionary orderings – cheap to compute.
We split genomes into phylogenetically related batches, then reorder each batch via a tree.
Enough for big compression gains. 9/
Yes and no.
No need for perfect trees – just rough evolutionary orderings – cheap to compute.
We split genomes into phylogenetically related batches, then reorder each batch via a tree.
Enough for big compression gains. 9/
So even basic compressors can easily pick them up.
Here's what the resulting structure looks like in Bloom filters (columns = genomes, rows = k-mer hashes). Highly locally compressible! 8/
So even basic compressors can easily pick them up.
Here's what the resulting structure looks like in Bloom filters (columns = genomes, rows = k-mer hashes). Highly locally compressible! 8/
The result?
BLAST – the "Google of biology" – can search only a fraction of sequenced microbes, and that fraction is shrinking exponentially over time. 2/
The result?
BLAST – the "Google of biology" – can search only a fraction of sequenced microbes, and that fraction is shrinking exponentially over time. 2/
Our paper in @naturemethods.bsky.social answers this: use evolutionary history to guide compression and search.
rdcu.be/eg4OA
w/ @baym.lol, @zaminiqbal.bsky.social et al. 🧵1/
Our paper in @naturemethods.bsky.social answers this: use evolutionary history to guide compression and search.
rdcu.be/eg4OA
w/ @baym.lol, @zaminiqbal.bsky.social et al. 🧵1/