banner
dgautheret.bsky.social
@dgautheret.bsky.social
Trying to make sense of #RNA information.
Stay tuned: We are now running Metapuccino on SRA’s 1 million human transcriptomes.
November 2, 2025 at 10:14 AM
This ms. covers the full methodology and discusses the limits of NLP and LLMs for NGS metadata completion.
November 2, 2025 at 10:14 AM
Usability was a top priority: Metapuccino runs on regular computers with open-source LLMs, but can also scale up on GPUs for large datasets. All it needs is a list of SRA IDs — no pre-processed tables required.
November 2, 2025 at 10:14 AM
Fiona Hak developed a clever LLM training strategy using the hardest SRA cases — the fine-tuned model is available on Hugging Face.
November 2, 2025 at 10:14 AM
Metapuccino fills and standardizes 19 key SRA metadata fields in human transcriptomics, using rule-based NLP and a large language model (LLM).
November 2, 2025 at 10:14 AM
Even simple tasks, like selecting tumor vs. normal samples for a cancer type, require expert curation across multiple tables, protocols, and abstracts.
November 2, 2025 at 10:14 AM
NCBI’s SRA is a fantastic resource for studying the human transcriptome. But its metadata is messy — over 70% of fields are empty, and information is often inconsistent.
November 2, 2025 at 10:14 AM