Andre Kahles
Andre Kahles
@akkah21.bsky.social
While MetaGraph provides a lossless representation of the input k-mer set, it is not a lossless compression of the raw reads. To reach petabase scale, we remove noisy k-mers prior to indexing — a step that we show has only minimal impact on search sensitivity.
October 8, 2025 at 8:56 PM
We show that MetaGraph indexes are both scalable and cost-efficient for querying. We Searching 1 Mbp of sequence against the entire SRA costs less than $1 on standard cloud infrastructure — making Petabase-scale biological data truly searchable and accessible.
October 8, 2025 at 8:56 PM
Our indexes support fast exact matching as well as alignment with edits. Labels can represent sample metadata, coordinates or quantification values. We can store 10’000 human transcriptome samples in < 160 GB and return position-wise expression for any queried sequence.
October 8, 2025 at 8:56 PM
We have already processed more than 10 Petabases of raw sequence data from the SRA and make the compressed indexes publicly available for search (metagraph.ethz.ch), download and cloud-based access.
October 8, 2025 at 8:56 PM
At its core, MetaGraph represents all input sequences as labeled, succinct de Bruijn graphs — a highly compressed yet fully searchable structure. Each k-mer carries metadata labels that remain interactively queryable through a flexible API.
October 8, 2025 at 8:56 PM