Lightnews — Scholar-powered news

Chris Mungall

@cmungall.bsky.social

In order to fine tune the reasoner model, the authors used three kinds of soft verifiers in the RL loop - experimental (e.g. CRISPRi knockdown), "simulation" (e.g Transcriptformer), and knowledge-based. For knowledge-based, they used GO @geneontology.bsky.social!

Table 1. Verifiers used during RL training and their descriptions, as well as example prompts.
Verifiers: "Exp" is experimental; "MLP" is multi-layer perceptron; "TF" is Transcriptformer; "GO" is
Gene Ontology.

August 25, 2025 at 12:41 AM

Chris Mungall

@cmungall.bsky.social

The applications of this are very interesting, allowing for interrogation in natural language, as well as background reasoning over the wealth of biology in the literature. So you can ask what happens to other genes if you knock down a gene in a cell type, and get a biological explanation

Fig 7 from paper - The figure shows three example responses to the query “Is a knockdown of ISCA2 in RPE1 cells likely to result in differential expression of CEP295?”, each demonstrating different reasoning strategies.

Basic answer: ISCA2 is linked to cell cycle progression and DNA repair, so its knockdown could affect CEP295 expression, though experimental data would be needed to confirm directionality.

Chain-of-Thought: Provides background—ISCA2 is involved in cell cycle regulation; CEP295 in cilia formation. Knockdown of ISCA2 may influence cell cycle–related genes but there’s no direct evidence connecting it to CEP295 regulation.

Self-aware Chain-of-Thought: Notes ISCA2’s role in autophagy and related processes, but emphasizes that its relationship to CEP295 is indirect. Suggests that literature review would be required for confirmation, and stresses the absence of direct experimental evidence, while acknowledging possible indirect effects.

Overall, all answers converge on the idea that ISCA2 knockdown could plausibly influence CEP295 but highlight the uncertainty and need for direct experimental validation.

August 25, 2025 at 12:41 AM

Chris Mungall

@cmungall.bsky.social

@severaltimes.bsky.social talking about the BioPortal MCP at #BOSC2025 / #BOKR2025 #ISMBECCB2025

July 22, 2025 at 1:31 PM

Chris Mungall

@cmungall.bsky.social

And thank you to the ENCODE team who laid the groundwork for this AI work ten years ago, and took such care with annotating using standard ontologies pmc.ncbi.nlm.nih.gov/articles/PMC...

June 27, 2025 at 6:04 PM

Chris Mungall

@cmungall.bsky.social

And the example Colab notebook shows how you can use UBERON terms in API calls to explore tissue specificity. Nice! colab.research.google.com/github/googl...

code segment from collab notebook:

output = dna_model.predict_sequence(
sequence='GATTACA'.center(2048, 'N'), # Pad to valid sequence length.
requested_outputs=[dna_client.OutputType.DNASE],
ontology_terms=['UBERON:0002048'], # Lung.
)

June 27, 2025 at 6:04 PM

Chris Mungall

@cmungall.bsky.social

De Crécy-Lagard showed that when the DL approach was used on proteins that differed from those in the training set (the "unknome"), many of the predicted functions were biologically implausible or impossible, based on prior "deep knowledge" of microbial gene function, and hence likely wrong.

Table from https://www.biorxiv.org/content/10.1101/2024.07.01.601547v2.full showing classes of error types

June 6, 2025 at 4:49 AM

Chris Mungall

@cmungall.bsky.social

Here’s the results with the quotes. It’s reasonable to assume something like these snippets are included in the prompt, which will confound a simple LLM (even though to us it’s obvious they are unrelated)

Search results including quotation of
“the crow”. Result snippets include mention of “the crow “

January 4, 2025 at 5:40 AM

Chris Mungall

@cmungall.bsky.social

Hey @Sunbasketmeals your delivery company forgot to configure the RAG on their rubbish AI chatbot

December 7, 2024 at 10:34 PM

Chris Mungall

@cmungall.bsky.social

Glad to see our OntoGPT/SPIRES paper finally out in Bioinformatics!
academic.oup.com/bioinformati.... Great work from Harry Caufield who led the study, and all the authors. SPIRES uses a schema and ontology driven approach to extract complex knowledge nuggets from text.

Depiction of schema and ontology-driven workflow of SPIRES. Inputs are a LinkML schema and Text. OntoGPT will recurse through the schema iteratively parsing using an LLM and finally grounding using ontologies specified by LinkML value sets.

February 22, 2024 at 10:41 PM

Chris Mungall

@cmungall.bsky.social

Hey @Docker, what's with the sudden revoking of our sponsored open source subscription? We are getting this for @linkml_data and colleagues getting the same thing for @OBOFoundry. 🙏 for the O/S subscription, but more advance warning of its cancelation would have been nice☹️

December 7, 2024 at 10:34 PM

Chris Mungall

@cmungall.bsky.social

The idea here is to extract structured information from free text, e.g. a description of a person into a LinkML schema such as the tutorial PersonInfo schema https://github.com/linkml/linkml/blob/main/examples/PersonSchema/personinfo.yaml

December 7, 2024 at 10:44 PM

Chris Mungall

@cmungall.bsky.social

It also turns out the latent GPT KB ("no synopsis") method has an unfair advantage. For larger gene sets, synopses get truncated due to prompt length constraints. When we control for this and look at only gene sets <75 genes in size, ontological descriptions emerge as the winner!

December 7, 2024 at 11:56 PM

Chris Mungall

@cmungall.bsky.social

Our first results show that on the one hand GPT, does pretty well - when using gpt-3.5-turbo, most of the terms returned are actually statistically significant (0.65 when using refseq summaries). However, we rarely saw the most informative term included.

December 7, 2024 at 11:46 PM

Chris Mungall

@cmungall.bsky.social

Here is an example of SPINDOCTOR results overlaid on top of bona-fide GO enrichment results for sensory ataxia genes (significant terms indicated with bonferonni adjusted p-vals). Gene description sources as boxes (ONT=ontological description, NAR=narrative refseq, NS=no summary)

December 7, 2024 at 11:31 PM

Chris Mungall

@cmungall.bsky.social

We created a tool called SPINDOCTOR that performs summarization of gene sets. The idea is simple: after normalizing the input gene sets, it retrieves external gene descriptions, and generate a prompt which is fed to @OpenAI. The results are then parsed to ontology terms.

December 7, 2024 at 11:05 PM

Chris Mungall

@cmungall.bsky.social

Now all of a sudden everyone is asking their biological questions of ChatGPT and other AI agent. You can even feed it a list of gene symbols, and say "what's going on with all these genes?". And it will give you a plausible answer!

December 7, 2024 at 10:54 PM

Chris Mungall

@cmungall.bsky.social

Are #LLMs capable of interpreting the results of high-throughput genomics experiments? Given a list of genes (e.g. all genes over expressed under a certain condition), can an LLM tell us what those genes have in common, suggesting underlying biological mechanisms? 🧵

December 7, 2024 at 10:34 PM

Chris Mungall

@cmungall.bsky.social

Qualitatively, the results are variable but usually interesting. We learned a bit about controlling hallucinations. Extraction tasks are in general less prone to this than querying the GPT "knowledge base" directly, but this all feels more art than science at the moment...

December 7, 2024 at 11:20 PM

Chris Mungall

@cmungall.bsky.social

As an optional next step, this can be further transformed into an OWL TBox and reasoned over, allowing results to be auto-classified and validated for logical inconsistencies.

December 7, 2024 at 11:15 PM

Chris Mungall

@cmungall.bsky.social

This results in a structured nested document (YAML or RDF) conforming to the @linkml_data schema. Results are highly variable but usually informative about gaps in ontologies. Here we can see #FoodOn has good coverage but still some gaps...

December 7, 2024 at 11:10 PM

Chris Mungall

@cmungall.bsky.social

SPIRES stands for Structured Prompt Interrogation and Recursive Extraction of Semantics. It's geared at rich schemas (> just Relation Extraction) - for example, a recipe or a biological pathway doesn't really fit into a flat TSV structure, instead we break into nested classes

December 7, 2024 at 10:44 PM

Chris Mungall

@cmungall.bsky.social

Here's our pre-print describing our GPT-3 based knowledge extraction tool SPIRES: https://arxiv.org/abs/2304.02711. Great work from @harry_caufield et al! SPIRES allows you to specify a knowledge schema (in @linkml_data) and then populate instances of that schema from unstructured...

December 7, 2024 at 10:34 PM

Chris Mungall

@cmungall.bsky.social

And special thanks to @figgyjam for the awesome logo!

December 7, 2024 at 11:10 PM

Chris Mungall

@cmungall.bsky.social

We also have some preliminary support for import and export from the @cytoscape CX format, and for retrieving networks from the awesome @NDExProject (thanks to help from @benjamingyori). See https://github.com/INCATools/ontology-access-kit/pull/479 for more details

December 7, 2024 at 10:49 PM

Chris Mungall

@cmungall.bsky.social

Now you can explore RO (https://oborel.github.io/) like any other ontology, with inverses, domains, and ranges treated as edges. See https://github.com/INCATools/ontology-access-kit/pull/466

December 7, 2024 at 10:44 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news