Lightnews — Scholar-powered news

@dgautheret.bsky.social

120 followers 130 following 24 posts

Trying to make sense of #RNA information.

Posts Replies Media Videos

dgautheret.bsky.social

@dgautheret.bsky.social

Stay tuned: We are now running Metapuccino on SRA’s 1 million human transcriptomes.

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

This ms. covers the full methodology and discusses the limits of NLP and LLMs for NGS metadata completion.

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

Usability was a top priority: Metapuccino runs on regular computers with open-source LLMs, but can also scale up on GPUs for large datasets. All it needs is a list of SRA IDs — no pre-processed tables required.

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

Fiona Hak developed a clever LLM training strategy using the hardest SRA cases — the fine-tuned model is available on Hugging Face.

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

Metapuccino fills and standardizes 19 key SRA metadata fields in human transcriptomics, using rule-based NLP and a large language model (LLM).

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

Even simple tasks, like selecting tumor vs. normal samples for a cancer type, require expert curation across multiple tables, protocols, and abstracts.

November 2, 2025 at 10:14 AM

dgautheret.bsky.social

@dgautheret.bsky.social

NCBI’s SRA is a fantastic resource for studying the human transcriptome. But its metadata is messy — over 70% of fields are empty, and information is often inconsistent.

November 2, 2025 at 10:14 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news