Matt McGuffie
banner
mcg.bio
Matt McGuffie
@mcg.bio
mcg.bio

bioinformatics @Twist

formerly: bioinformatics lead at plasmidsaurus, PhD at University of Texas at Austin

bioinformatics, bacteriology, phages, engineered plasmids, synthetic biology, insects
Reposted by Matt McGuffie
The average nucleotide identity (ANI) underpins how we map microbial diversity, compare species, and connect genomes to ecology.
I wrote a short piece reflecting on the discovery and significance of this metric (and really enjoyed digging into the context and story behind it!) #microsky 🧬
Average nucleotide identity — the backbone of modern ecological genomics - Nature Reviews Genetics
In this Journal Club, Luis Orellana recalls a 2005 publication by Konstantinidis and Tiedje that introduced average nucleotide identity as a sequence-based metric to determine the relatedness between ...
www.nature.com
October 30, 2025 at 9:11 PM
Reposted by Matt McGuffie
Unannotated translation products are widespread in model E. coli | bioRxiv https://www.biorxiv.org/content/10.1101/2025.09.25.678689v1?rss=1
Unannotated translation products are widespread in model E. coli
Genomes contain orders of magnitude more open reading frames (ORFs) than known protein coding genes, and recent work suggests there may be unannotated proteins present in even the best studied organisms. To address this gap, we used a high throughput reverse genetic toolkit to construct precise C-terminal fusions of a reporter (and control) to >120,000 ORFs in model E. coli . We found hundreds of unannotated significant hits, and individually detected >50 novel polypeptides by western blot, including ORFs within tRNA loci. Many ORFs overlap annotated genes in the sense orientation, and we found these are likely chimeric polypeptides produced by ribosomal frameshifting. Using degron based knockdowns, we identified unannotated proteins that have putative fitness effects, and we found a novel small protein that displays phenotypes consistent with a role in the mRNA degradosome. The observation of a range of unannotated translation products should lead to better annotation and understanding of the bacterial domain of life and motivates the continued exploration of genomes broadly. ### Competing Interest Statement The authors have declared no competing interest.
www.biorxiv.org
September 27, 2025 at 4:14 AM
Reposted by Matt McGuffie
C. elegans is a real animal and we set out to understand how it comes to have its distinctive biogeography. Its ancestral center of diversity is in the higher elevation forests of Hawaii. Its closest relatives are spread across east Asia. Did they travel from Asia? [Preprint 🧵]
September 24, 2025 at 8:33 PM
Reposted by Matt McGuffie
Heads up: ignore samtools dot org, similarly minimap2 dot com and likely others. It's owned by a known phishing site and while the binaries they offer look valid currently (but note they may be serving us different binaries to others), that could change.

Ie: it's not us (Samtools team)! Be warned
September 15, 2025 at 8:40 AM
Reposted by Matt McGuffie
tgv 0.1.0 release: github.com/zeqianli/tgv
- Rich CIGAR and base visualization
- Allele frequency visualization
- VCF and BED file support
- Mouse dragging and hovering
- Filter alignment

Now 90% of what I need from IGV can be done in the terminal.

Some interesting behind-the-scenes:
September 7, 2025 at 11:47 PM
Reposted by Matt McGuffie
Excited to share our new preprint for the tskit_arg_visualizer Python package! ARGs can sometimes feel like a black box, so
@yanwong.bsky.social and I have been developing a method to programmatically drawing these graphs.

🔗 arxiv.org/abs/2508.03958

1/6
tskit_arg_visualizer: interactive plotting of ancestral recombination graphs
Summary: Ancestral recombination graphs (ARGs) are a complete representation of the genetic relationships between recombining lineages and are of central importance in population genetics. Recent brea...
arxiv.org
August 19, 2025 at 2:12 PM
Reposted by Matt McGuffie
August 11, 2025 at 10:30 AM
Reposted by Matt McGuffie
#NatMicroPicks

Hidden microbial world in trees🌳

Living wood hosts trillions of bacteria making trees a complex ecosystems with major roles in forest health and function.

#PlantMicro #MicroSky

www.nature.com/articles/s41...
A diverse and distinct microbiome inside living trees - Nature
Microbiome analyses of living trees show that a single tree can host approximately one trillion bacteria, with microbial communities distinctly partitioned between heartwood and sapwood and with minim...
www.nature.com
August 8, 2025 at 2:28 PM
Reposted by Matt McGuffie
StrainR2 accurately deconvolutes strain-level abundances in synthetic microbial communities. #Metagenomics #StrainLevelAbundance #Bioinformatics
academic.oup.com/bioinformati...
August 9, 2025 at 6:05 PM
Reposted by Matt McGuffie
Protein language models reveal evolutionary constraints on synonymous codon choice
#rnasky #microsky "cotranslational localization and translational accuracy, more than cotranslational protein folding, are major drivers of selective pressure on codon choice" in yeast here 💫
doi.org/10.1101/2025...
Protein language models reveal evolutionary constraints on synonymous codon choice
Evolution has shaped the genetic code, with subtle pressures leading to preferences for some synonymous codons over others. Codons are translated at different speeds by the ribosome, imposing constrai...
doi.org
August 9, 2025 at 6:31 PM
Reposted by Matt McGuffie
taxburst v0.3.0 is now released - this is an update of the Krona visualization system for microbiome/metagenome taxonomy analyses. Enjoy!
Announcing taxburst, an update of the Krona software for taxonomy exploration
Announcing taxburst for metagenome taxonomy!
ivory.idyll.org
August 8, 2025 at 2:19 PM
Reposted by Matt McGuffie
Excited to share work with
Zhidian Zhang, @milot.bsky.social, @martinsteinegger.bsky.social, and @sokrypton.org
biorxiv.org/content/10.1...
TLDR: We introduce MSA Pairformer, a 111M parameter protein language model that challenges the scaling paradigm in self-supervised protein language modeling🧵
Scaling down protein language modeling with MSA Pairformer
Recent efforts in protein language modeling have focused on scaling single-sequence models and their training data, requiring vast compute resources that limit accessibility. Although models that use ...
biorxiv.org
August 5, 2025 at 6:31 AM
Reposted by Matt McGuffie
Fun new tool from Heng Li. Thinking maybe I can use this to help find plasmid replication gene correlated repeat regions - though he specifically mentions it's not for tandem repeat regions. Hmm. 🖥️🧬

github.com/lh3/longdust
GitHub - lh3/longdust: Identify long STRs, VNTRs, satellite DNA and other low-complexity regions in a genome
Identify long STRs, VNTRs, satellite DNA and other low-complexity regions in a genome - lh3/longdust
github.com
August 1, 2025 at 11:34 AM
Reposted by Matt McGuffie
🚨 New preprint 🚨

My phage annotation tool, Phynteny, finally has a preprint and a brand new version powered by a cool AI transformer architecture and protein language models! #phagesky

www.biorxiv.org/content/10.1...
Synteny-aware functional annotation of bacteriophage genomes with Phynteny
Accurate genome annotation is fundamental to decoding viral diversity and understanding bacteriophage biology; yet, the majority of bacteriophage genes remain functionally uncharacterised. Bacteriopha...
www.biorxiv.org
July 30, 2025 at 6:01 AM
Reposted by Matt McGuffie
A bit late to joining the Bluesky party, but it's great to see all the amazing scientists who are on this platform! Looking forward to connecting with all of you here (on twitter as @niranjantw ... so keeping the handle consistent).
April 28, 2025 at 1:13 AM
Reposted by Matt McGuffie
AFESM: a metagenomic guide through the protein structure universe! We clustered 821M structures (AFDB&ESMatlas) into 5.12M groups; revealing biome-specific groups, only 1 new fold even after AlphaFold2 re-prediction & many novel domain combos. 🧵
🌐 afesm.foldseek.com
📄 www.biorxiv.org/content/10.1...
April 27, 2025 at 12:13 AM
Reposted by Matt McGuffie
I'll be presenting our work on hyper-k-mers at #RECOMB today at 10:40 KST!

You can get a sneak peek at the slides here: igor.martayan.org/slides-recom...

Come say hi if you'd like to chat, or just get one of these cute stickers!
April 26, 2025 at 10:23 PM
Reposted by Matt McGuffie
Assemblies of long-read metagenomes suffer from diverse errors https://www.biorxiv.org/content/10.1101/2025.04.22.649783v1
April 25, 2025 at 12:46 AM
Reposted by Matt McGuffie
Uncalled4: a toolkit for nanopore signal alignment, analysis and visualization of DNA and RNA modifications.

www.nature.com/articles/s41...
March 28, 2025 at 5:23 PM
Reposted by Matt McGuffie
wow, telomeric transposons in bacteria with linear chromosomes! (of course this was first figured out in flies, inc by Bob Levis, who i was happy to see few days ago at the fly meeting). 🪰

www.science.org/doi/10.1126/...

www.sciencedirect.com/science/arti...

www.sciencedirect.com/science/arti...
Telomeric transposons are pervasive in linear bacterial genomes
Eukaryotes have linear DNA, and their telomeres are hotspots for transposons, which in some cases took over telomere maintenance. We identified several families of independently evolved telomeric tran...
www.science.org
March 27, 2025 at 8:55 PM
Reposted by Matt McGuffie
Do you (like me) create a bunch of conda environments, then later forget what they're for, when they were last updated, or which tools are in them?

If so, you might this little project: github.com/rrwick/conda...
GitHub - rrwick/condaenvlist: a simple tool for listing conda environments with descriptions
a simple tool for listing conda environments with descriptions - rrwick/condaenvlist
github.com
March 27, 2025 at 4:34 AM
Reposted by Matt McGuffie
longcallD is a new variant caller for genomic long reads. It jointly calls phased small and structural variants. Single binary, one command line for the whole process. Comparable accuracy to mainstream callers. Great work by Yan Gao. github.com/yangao07/lon...
GitHub - yangao07/longcallD: A local-haplotagging-based small and structural variant caller
A local-haplotagging-based small and structural variant caller - yangao07/longcallD
github.com
March 24, 2025 at 4:53 PM