Lightnews — Scholar-powered news

Jouni Sirén

@jltsiren.bsky.social

Exact matches are the easy cases. Either you have a single hit with a high mapping quality, or you take a random hit with a low mapping quality. The hard case is when you get multiple overlapping seeds for a read, with many hits each, and you need to choose the hits you try to align.

November 11, 2025 at 11:58 PM

Jouni Sirén

@jltsiren.bsky.social

My impression is that the throughput of a fast read aligner is usually ~1 Mbp / CPU-second. Most of the time is spent with reads with many potential mappings, as the aligner needs to explore them until it's confident it has found the best one and can estimate the mapping quality.

November 11, 2025 at 11:11 PM

Jouni Sirén

@jltsiren.bsky.social

One intended use for the header lines is specifying which graphs can be used as a reference for the alignments. This will use stable graph names based on hashing a canonical GFA representation of the graph. The idea is similar to refget, but for graphs instead of sequences.

GitHub - jltsiren/pggname: Pangenome graph naming based on hashing in a canonical order

Pangenome graph naming based on hashing in a canonical order - jltsiren/pggname

github.com

October 31, 2025 at 3:27 AM

Jouni Sirén

@jltsiren.bsky.social

So maybe we need some kind of stable identifiers (hashes?) for pangenome graphs. And then we need a way of storing graph / parent identifiers in GFA and alignment files. 7/7

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

We also need a way of specifying the correct reference for reconstructing the reads. That's not as easy with graphs as with linear sequences. For example, if you have aligned the reads to a subgraph (e.g. personalized graph), the supergraph (e.g. clipped graph) is also a valid reference. 6/n

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

While working on GAF-base, I realized that GAF is not the file format I want to use. GAF prioritizes numerical statistics, while the information needed for reconstructing the read and the alignment is optional. In archival and variant calling, it should be the opposite. 5/n

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

When used with GBZ-base, GAF-base allows extracting all reads overlapping with / contained in the subgraph. Queries with 10 kbp subgraphs are effectively instantaneous with short reads, while taking a second or two with long reads. 4/n

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

Recently, I started working on another database: GAF-base. It could be described as "hacky pangenome CRAM in SQLite". GAF-base works, at least with reads aligned with Giraffe, and file sizes are typically somewhere between BAM and CRAM. 3/n

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

It has been useful for investigating various vg issues, and there are also some external users. Sequence Tube Map would be nice application, but we are not there yet. 2/n

GitHub - vgteam/sequenceTubeMap: displays multiple genomic sequences in the form of a tube map

displays multiple genomic sequences in the form of a tube map - vgteam/sequenceTubeMap

github.com

August 28, 2025 at 12:49 AM

Jouni Sirén

@jltsiren.bsky.social

Reasoning about maximal paths is difficult, as they are global objects. Giovanni and Travis came up with an equivalent local property related to stable sorting. They called it Wheeler graphs, and that's when theoretical developments took off. 6/6

August 8, 2025 at 9:49 AM

Jouni Sirén

@jltsiren.bsky.social

BOSS was a parallel development for de Bruijn graphs. It inspired me to look into extending the GCSA from DAGs to more general graphs. We ended up with what is now known as Wheeler DFAs, characterized by non-overlapping lexicographic ranges of maximal path labels. 5/6

August 8, 2025 at 9:49 AM

Jouni Sirén

@jltsiren.bsky.social

That graph represents recombinations at aligned positions with a unique context after the position. By using a prefix-doubling algorithm, we can instead get a graph that represents recombinations at any aligned position. And that graph was GCSA. 4/6

August 8, 2025 at 9:49 AM

Jouni Sirén

@jltsiren.bsky.social

The analysis starts with a multiple sequence alignment and counts the number of runs of aligned suffixes in lexicographic order. If you collapse the runs into nodes, you get a graph that can be indexed with a slight extension of the XBWT. 3/6

August 8, 2025 at 9:49 AM

Jouni Sirén

@jltsiren.bsky.social

I contributed to a chapter of graph indexing before/beyond Wheeler graphs in Manzini's Festschrift. I wrote a bit on how the analysis of RLBWT under duplication and edits became GCSA, and how GCSA is related to Wheeler graphs. 2/6

Graph Indexing Beyond Wheeler Graphs

drops.dagstuhl.de

August 8, 2025 at 9:49 AM

Jouni Sirén

@jltsiren.bsky.social

Maybe you can write useful specs after an extensible file format has become popular and evolved over time. But then it will be difficult to convince people to switch to the new format.

August 5, 2025 at 11:05 PM

Jouni Sirén

@jltsiren.bsky.social

Well-specified formats would be nice. But it's too likely that the specification is obsolete. Or it lacks necessary features. Or it just doesn't exist, because key people can't agree on the details. (Or all three, as with most pangenome file formats.)

August 5, 2025 at 11:05 PM

Jouni Sirén

@jltsiren.bsky.social

As long as we are talking about academic code, relying on someone else's libraries is risky. They are often abandoned as labs lose funding, people move on or leave the academia, and so on. A serious binary format should have independent implementations maintained by different labs.

August 5, 2025 at 8:41 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news