Lightnews — Scholar-powered news

Reposted by Sergey Nurk

Sina Majidian

@sinamajidian.bsky.social

Fantastic talk by @vikramshivakumar.bsky.social Mumemto—Scalable multi-MUM finding for pangenomes
Papers biorxiv.org/content/10.1101/2025.05.20.654611 & doi.org/10.1186/s13059-025-03644-0
Code: github.com/vikshiv/mume...
Very efficient pangenome visualization tool, revealing synteny and variations!

Figure 1: (A) Anchor-based merging requires a common sequence (red) present in each partition. Multi-MUMs are merged by identifying overlaps between partition-specific matches in the anchor coordinate space, and a uniqueness threshold determines if a MUM is still unique in each partition after truncation. (B) String-based merging enables compu- tation of multi-MUMs between partitions without a common sequence. An example tree (left) is shown, highlighting the use case where partial multi-MUMs specific to internal nodes (starred) can be computed by merging subclade-based partitions up a tree. (right) MUM overlaps are computed by running Mumemto on the MUM sequences, and the uniqueness threshold array ensures overlaps remain unique across the merged dataset. (C) An example Burrows-Wheeler Transform (BWT), matrix (BWM), and Longest Com- mon Prefix (LCP) array, with sequence IDs for each suffix shown (ID). A non-maximal unique match (UM) is shown, and the uniqueness threshold for this match is found us- ing the flanking LCP values. (D) A partial multi-MUM (in blue) is found in all-but-one sequence (excluded in red). Using two anchor sequences (red and orange), all-but-one partial MUMs can be computed using an augmented anchor-based merging method (sec- tion 2.6).

November 6, 2025 at 1:13 AM

Reposted by Sergey Nurk

Camille Marchet ⚡

@camillemrcht.bsky.social

Thread on #GI2025 's second day! 👇🏻

Sina Majidian @sinamajidian.bsky.social · 12d

Second day of Genome Informatics #GI2025 began with the session “Genome Assembly and Sequence Algorithms" Yun William Yu presented “Average-case Analysis of Seed-Chain-Extend under Random Mutations"
genome.cshlp.org/content/33/7/1175
providing theoretical guarantees for the popular seed-chain-extend

$Abstract: Seed-chain-extend with k-mer seeds is a powerful heuristic technique for sequence alignment used by modern sequence aligners. Although effective in practice for both runtime and accuracy, theoretical guarantees on the resulting alignment do not exist for seed-chain-extend. In this work, we give the first rigorous bounds for the efficacy of seed-chain-extend with k-mers in expectation. Assume we are given a random nucleotide sequence of length ∼n that is indexed (or seeded) and a mutated substring of length ∼m ≤ n with mutation rate θ < 0.206. We prove that we can find a k = Θ(log n) for the k-mer size such that the expected runtime of seed-chain-extend under optimal linear-gap cost chaining and quadratic time gap extension is O(mn^f(θ) log n), where f(θ) < 2.43 · θ holds as a loose bound. The alignment also turns out to be good; we prove that more than 1-o(sqrt(1/m)) fraction of the homologous bases is recoverable under an optimal chain. We also show that our bounds work when k-mers are sketched, that is, only a subset of all k-mers is selected, and that sketching reduces chaining time without increasing alignment time or decreasing accuracy too much, justifying the effectiveness of sketching as a practical speedup in sequence alignment. We verify our results in simulation and on real noisy long-read data and show that our theoretical runtimes can predict real runtimes accurately. We conjecture that our bounds can be improved further, and in particular, f(θ) can be further reduced.$

November 6, 2025 at 5:53 PM

Reposted by Sergey Nurk

Mile Sikic

@msikic.bsky.social

🚀 Looking for talented PhD students!
Join us in 🇸🇬 Singapore for 1-2 years to push the frontiers of AI for Genomics.
Work on:
🧬 Cancer genome reconstruction
🧫 Cancer genome & cell foundation models
💊 RNA drug & mRNA therapeutic design

#AI #Genomics #PhD
1/5

November 4, 2025 at 7:32 AM

Reposted by Sergey Nurk

Ragnar {Groot Koerkamp}

@curiouscoding.nl

Following ish's `filter` and bqtools' `grep`, Sassy now also has initial support for grep and filter!

Grep mode shows all matches, grouped per record, and is meant for human consumption.
Filter mode prints full matching (or non-matching) records to stdout or output files.

Output of `sassy grep -p ACTGGCATGAGAACTGAG -k1 human-genome.fa`. Should fuzzy matches with up to one error, grouped by file, then by record. The matching part is highlighted with colours (green=match, orange=mismatch, red=delete, blue=insert), and the strandedness of each match, the location, and the cost are shown.

October 30, 2025 at 11:46 PM

Reposted by Sergey Nurk

Ben Langmead

@benlangmead.bsky.social

Very excited about Movi 2! Excellent work by Mohsen here. FYI, I have a series of 5 videos on the move structure starting with this one: youtu.be/REniD2dKf6A?...

October 21, 2025 at 9:39 PM

Reposted by Sergey Nurk

Claudia Gonzaga-Jauregui

@cgonzagaj.bsky.social

ASHG Plenary Session starting with the awards ceremony honoring Eric Green with the Leadership Award of @geneticssociety.bsky.social reflecting on his career in human genetics & genomics leading the Human Genome Project & the NHGRI and the leadership principles he has learned throughout #ASHG25

October 16, 2025 at 8:47 PM

Reposted by Sergey Nurk

Adam Phillippy

@aphillippy.bsky.social

The T2T zebra finch genome has hatched! 🐣 🧬 @vertebrategenomes.bsky.social

bioRxiv Genomics @biorxiv-genomic.bsky.social · Oct 15

The complete genome of a songbird https://www.biorxiv.org/content/10.1101/2025.10.14.682431v1

October 15, 2025 at 1:08 PM

Reposted by Sergey Nurk

Ewan Birney

@ewanbirney.bsky.social

I am hiring! - looking for a Staff Scientist to co-run my research group with me. Staff Scientist is a senior professional scientist role at EMBL. Please forward to people you might know who could be interested! embl.wd103.myworkdayjobs.com/en-US/EMBL/j...

Staff Scientist

About EMBL-EBI EMBL’s European Bioinformatics Institute is a data powerhouse, utilised on a global scale to advance scientific discovery through bioinformatics and solutions to some of the world’s mos...

embl.wd103.myworkdayjobs.com

October 10, 2025 at 7:30 AM

Reposted by Sergey Nurk

Rob Patro

@robp.bsky.social

The Metagraph paper is out in Nature; it showed up in my feeds today! Congratulations to Mikhail Karasikov, @gxxxr.bsky.social, @akkah21.bsky.social and all of the other authors (whom I'd love to follow on Bluesky if I can find you ;P) www.nature.com/articles/s41...

Efficient and accurate search in petabase-scale sequence repositories - Nature

MetaGraph enables scalable indexing of large sets of DNA, RNA or protein sequences using annotated de Bruijn graphs.

www.nature.com

October 9, 2025 at 2:40 PM

Reposted by Sergey Nurk

Ben Langmead

@benlangmead.bsky.social

I've added 7 videos to my Burrows-Wheeler indexing playlist (www.youtube.com/playlist?lis...), rounding out the r-index series and adding a 5-part series on the move structure. Now 27 videos in that playlist. I aim to add videos on prefix-free parsing, PBWT, Wheeler languages/automata in the future.

Burrows-Wheeler Indexing - YouTube

Videos on : (a) the Burrows-Wheeler Transform (BWT), (b) the FM Index, which uses the BWT to construct a full-text index, (c) Wheeler graphs, (d) r-index, an...

www.youtube.com

October 7, 2025 at 2:17 PM

Reposted by Sergey Nurk

Xian Chang

@xian-chang.bsky.social

🦒Long read giraffe is out!🦒
Mapping long reads to pangenome graphs is ~10x faster than with GraphAligner, with veeery slightly better mapping accuracy, short variant calling, and SV genotyping than GraphAligner or Minimap2

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · Oct 1

Rapid, accurate long- and short-read mapping to large pangenome graphs with vg Giraffe https://www.biorxiv.org/content/10.1101/2025.09.29.678807v1

October 2, 2025 at 6:28 AM

Reposted by Sergey Nurk

Heng Li

@lh3lh3.bsky.social

Do you know ~60% of human SVs fall in ~1% of GRCh38? See our new preprint: arxiv.org/abs/2509.23057 and the companion blog post on how we started this project and longdust: lh3.github.io/2025/09/29/o.... Work with Alvin Qin

September 30, 2025 at 2:19 AM

Reposted by Sergey Nurk

Zamin Iqbal

@zaminiqbal.bsky.social

Delighted to see our paper studying the evolution of plasmids over the last 100 years, now out! Years of work by Adrian Cazares, also Nick Thomson @sangerinstitute.bsky.social - this version much improved over the preprint. Final version should be open access, apols.
Thread 1/n

September 25, 2025 at 9:29 PM

Reposted by Sergey Nurk

Adam Phillippy

@aphillippy.bsky.social

Delighted to finally announce a preprint describing the Q100 project! “A complete diploid human genome benchmark for personalized genomics” For which we finished HG002 to near-perfect accuracy: www.biorxiv.org/content/10.1... 🧵[1/14]

A complete diploid human genome benchmark for personalized genomics

Human genome resequencing typically involves mapping reads to a reference genome to call variants; however, this approach suffers from both technical and reference biases, leaving many duplicated and ...

www.biorxiv.org

September 22, 2025 at 5:01 PM

Reposted by Sergey Nurk

Sina Majidian

@sinamajidian.bsky.social

Excited to share our EvANI benchmarking workflow, published in Briefings in Bioinformatics doi.org/10.1093/bib/...
Computing average nucleotide identity (ANI) is neither conceptually nor computationally trivial. Its definition has evolved over years, with different meanings and assumptions (1/5)

Figure 1(A) ANI quantifies the similarity between two genomes. ANI can be defined as the number of aligned positions where the two aligned bases are identical, divided by the total number of aligned bases. Historically, ANI was calculated using a single gene family for multiple sequence alignment. Another approach finds orthologous genes between two genomes and reports the average similarity between their CDSs. This method was later extended to whole-genome alignment by identifying local alignments and excluding supplementary alignments with lower similarity. (B) Different ANI tools employ various approaches in calculating ANI values. ANIm, OrthoANI, and FastANI use aligners to identify homologous regions, whereas Mash uses k-mer hashing to estimate similarities. Only alignments with higher similarity represented by green arrows are included in ANI calculations, while red arrows, corresponding to paralogs, are excluded. (C) The proposed benchmarking method evaluates the performance of different tools using both real and simulated data. It assumes that more distantly related species on the phylogenetic tree should have lower ANI similarities. This is measured by calculating the statistics of Spearman rank correlation. We expect a negative correlation between ANI and the tree distance (scatter plot on the right).
https://academic.oup.com/bib/article/doi/10.1093/bib/bbaf267/8160681

September 21, 2025 at 3:26 PM

Reposted by Sergey Nurk

Martin Steinegger 🇺🇦

@martinsteinegger.bsky.social

MMseqs2-GPU sets new standards in single query search speed, allows near instant search of big databases, scales to multiple GPUs and is fast beyond VRAM. It enables ColabFold MSA generation in seconds and sub-second Foldseek search against AFDB50. 1/n
📄 www.nature.com/articles/s41...
💿 mmseqs.com

GPU-accelerated homology search with MMseqs2 - Nature Methods

Graphics processing unit-accelerated MMseqs2 offers tremendous speedups for homology retrieval from metagenomic databases, query-centered multiple sequence alignment generation for structure predictio...

www.nature.com

September 21, 2025 at 8:06 AM

Reposted by Sergey Nurk

Seth Stadick

@ducktapeprogrammer.bsky.social

Some very interesting preliminary benchmarks between Stringzilla and some Rust libs from Ash: x.com/ashvardanian...

Including some alignment comparisons.

Ash Vardanian on X: "Let there be https://t.co/KbIf9uMiVC 🧵 So many great libraries in Rust - the perfect soil for a thematic benchmark - across hashing, similarity scoring, search, and sketching Write-up coming, but there are already some indicative intermediate results https://t.co/B4rGLllnXM" / X

Let there be https://t.co/KbIf9uMiVC 🧵 So many great libraries in Rust - the perfect soil for a thematic benchmark - across hashing, similarity scoring, search, and sketching Write-up coming, but there are already some indicative intermediate results https://t.co/B4rGLllnXM

x.com

September 12, 2025 at 12:21 AM

Reposted by Sergey Nurk

Heng Li

@lh3lh3.bsky.social

Now preprinted at arxiv.org/abs/2509.07357

September 10, 2025 at 2:10 AM

Reposted by Sergey Nurk

Jim Shaw

@jimshaw.bsky.social

Preprint out for myloasm, our new nanopore / HiFi metagenome assembler!

Nanopore's getting accurate, but

1. Can this lead to better metagenome assemblies?
2. How, algorithmically, to leverage them?

with co-author Max Marin @mgmarin.bsky.social, supervised by Heng Li @lh3lh3.bsky.social

1 / N

bioRxiv Bioinfo @biorxiv-bioinfo.bsky.social · Sep 7

High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1

September 7, 2025 at 11:35 PM

Reposted by Sergey Nurk

Zamin Iqbal

@zaminiqbal.bsky.social

For anyone who has used pling for comparing plasmids using rearrangement distances ("how many structural events apart are these plasmids"), here's how to tweak parameters, and integrate it with typing info, and the host phylogeny
www.biorxiv.org/content/10.1...
github.com/iqbal-lab-or...

Clustering of plasmid genomes for genomic epidemiology by using rearrangement distances, with pling

Integration of plasmids into genomic epidemiology is challenging, because there are no clearly defined evolving-units (equivalent to species), and because plasmids appear to evolve as much by structur...

www.biorxiv.org

September 7, 2025 at 2:56 PM

Reposted by Sergey Nurk

bioRxiv Bioinfo

@biorxiv-bioinfo.bsky.social

High-resolution metagenome assembly for modern long reads with myloasm https://www.biorxiv.org/content/10.1101/2025.09.05.674543v1

September 7, 2025 at 4:47 AM

Reposted by Sergey Nurk

Rayan Chikhi

@rayanchikhi.bsky.social

🌎👩‍🔬 For 15+ years biology has accumulated petabytes (million gigabytes) of🧬DNA sequencing data🧬 from the far reaches of our planet.🦠🍄🌵

Logan now democratizes efficient access to the world’s most comprehensive genetics dataset. Free and open.

doi.org/10.1101/2024...

September 3, 2025 at 8:39 AM

Reposted by Sergey Nurk

Giulio Ermanno Pibiri

@jermp.bsky.social

We are glad to announce that the next workshop “Data Structures in Bioinformatics” (DSB 2026) will take place in Venice, Italy, on *February 18-19*, 2026. dsb-meeting.github.io/DSB2026/ Book the dates! #DSB26

DSB 2026 Venice - February 18-19

Workshop Data Structures in Bioinformatics

dsb-meeting.github.io

September 1, 2025 at 6:10 PM

Reposted by Sergey Nurk

Shawn Burgess

@burgesslab.bsky.social

#zebrafish genome update, our T2T assembly of the inbred strain of AB (M-AB) generated by my buddy Nori Sakai has now been released at NCBI and will be a second reference genome for zebrafish (GRCz12ab):

JBQAYU000000000.1 Danio rerio :: NCBI

www.ncbi.nlm.nih.gov

August 15, 2025 at 4:17 PM

Reposted by Sergey Nurk

Yo Akiyama

@yoakiyama.bsky.social

Excited to share work with
Zhidian Zhang, @milot.bsky.social, @martinsteinegger.bsky.social, and @sokrypton.org
biorxiv.org/content/10.1...
TLDR: We introduce MSA Pairformer, a 111M parameter protein language model that challenges the scaling paradigm in self-supervised protein language modeling🧵

Scaling down protein language modeling with MSA Pairformer

Recent efforts in protein language modeling have focused on scaling single-sequence models and their training data, requiring vast compute resources that limit accessibility. Although models that use ...

biorxiv.org

August 5, 2025 at 6:31 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news