Reducing haystacks to needles - ViralClust: A Nextflow pipeline to cluster viral sequences
The rapid accumulation of viral genome sequences presents major challenges for downstream analysis tools, including tools for multiple sequence alignments, phylogeny, and genome/alignment visualization, due to computational constraints and sampling biases caused by outbreak-driven over- representation. Selecting representative genomes through clustering offers a principled alternative to random subsampling, yet choosing appropriate clustering strategies remains non-trivial and context dependent. Here, we present ViralClust, a modular Nextflow pipeline for bias-aware representative selection from large viral genome datasets. ViralClust integrates five distinct clustering algorithms (CD-HIT-EST, SUMACLUST, VSEARCH, MMSeqs2, and HDBSCAN) within a unified workflow, enabling direct comparison of clustering outcomes and flexible adaptation to diverse biological questions, considering a balanced phylogenic distribution of the selected sequences. We evaluated ViralClust on six RNA and DNA virus datasets ranging from 632 to 156,586 sequences and spanning genome lengths from 890 to 197,185 nucleotides. Across all datasets, clustering reduced dataset size by 95% or more while preserving genetic diversity across species, genera, and families, and effectively mitigating biases introduced by outbreaks, partial genomes, and sequence orientation artifacts. By supporting whole-genome clustering and scalable representative selection, ViralClust enables efficient and reproducible downstream analyses that would otherwise be computationally infeasible. Our framework provides a flexible foundation for large-scale viral genomics and supports future applications in comparative analysis and virus classification. ### Competing Interest Statement The authors have declared no competing interest. Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy, EXC 2051 - Project-ID 390713860 BMBF-funded project ADAPTI-M, Project-ID 031L0322H Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under NFDI4Microbiota, NFDI 28/1 - Project-ID 460129525