Charles Tapley Hoyt
banner
cthoyt.scholar.social.ap.brid.gy
Charles Tapley Hoyt
@cthoyt.scholar.social.ap.brid.gy
Bio/cheminformatician, software developer, open scientist. 🇺🇸 living in Bonn 🇩🇪🇪🇺 (he/him)

Current projects: @NFDI4Chem and @dalia

🌉 bridged from ⁂ https://scholar.social/@cthoyt, follow @ap.brid.gy to interact
Challenges with Semantic Mappings
There are many challenges associated with the curation, publication, acquisition, and usage of semantic mappings. This post examines their philosophical, technical, and practical implications, highlights existing solutions, and describes opportunities for next steps for the community of curators, semantic engineers, software developers, and data scientists who make and use semantic mappings. ### Proliferation of Formats The first major challenge with semantic mappings is the variety of forms they can take. This both includes different data models and serializations of those models. Let’s start with a lightning review (please let me know if I missed something): Simple Knowledge Organization System (SKOS) is a data model for RDF to represent controlled vocabularies, taxonomies, dictionaries, thesauri, and other semantic artifacts. It defines several semantic mapping predicates including for broad matches, narrow matches, close matches, related matches, and exact matches. JSKOS (JSON for Knowledge Organization Systems), a JSON-based extension of the SKOS data model. I recently wrote a post about converting between SSSOM and JSKOS. Web Ontology Language (OWL) is primarily used for ontologies. It has first-class language support for encoding equivalences between classes, properties, or individuals. Other semantic mappings can be encoded as annotation properties on classes, properties, or individuals, e.g., using SKOS predicates. The OBO Flat File Format is a simplified version of OWL with macros most useful for curating biomedical ontologies. It has the same abilities as OWL, but also the `xref` macro which corresponds to `oboInOwl:hasDbXref` relations, which are by nature imprecise and therefore used in a variety of ways. The Simple Standard for Sharing Ontological Mappings (SSSOM) is a fit-for-purpose format for semantic mappings between classes, properties, or individuals. SSSOM guides curators towards inputting key metadata that are typically missing from other formalisms and is gaining wider community adoption. Importantly, SSSOM integrates into ontology curation workflows, especially for Ontology Development Kit (ODK) users. The Expressive and Declarative Ontology Alignment Language (EDOAL) lives in a similar space to SSSOM, but IMO was much less approachable (c.f. XML + Java), and has not seen a lot of traction in the biomedical space. OntoPortal has its own data model for semantic mappings that has low metadata precision. I recently wrote a post on converting OntoPortal to SSSOM. OntoPortal would also like to invest more in SSSOM infrastructure if it can organize funding and human resources. Wikidata has its own data model for semantic mappings that include higher precision metadata. I recently wrote a post on mapping between the data models from SSSOM and Wikidata. Finally, there’s a long tail of mappings that live in poorly annotated CSV, TSV, Excel, and other formats. Similarly, mappings can live in plain RDF files, e.g., encoded with SKOS predicates, but without high precision metadata. ### Scattered, Partially Overlapping, and Incomplete Semantic mappings are not centralized, meaning that multiple sources of semantic mappings often need to be integrated to map between two semantic spaces. Even then, these integrated mappings are often incomplete. Using Medical Subject Headings (MeSH) and the Human Phenotype Ontology (HPO) as an example, we can see the following: 1. MeSH doesn’t maintain any mappings to HPO. 2. HPO maintains some mappings as primary mappings. 3. The Unified Medical Language System (UMLS) maintains some mappings as secondary mappings. HPO suggests using UMLS as a supplementary mapping resource. 4. Biomappings maintains some community-curated mappings as secondary mappings. This actually might not be the best example - it would have been better to show a pair of resources that both partially map to the other. When I first made this chart, I had to engineer the UMLS inference by hand. Eventually, the need to generalize this workflow led to the development of the Semantic Mapping Reasoner and Assembler (SeMRA) Python package which does this automatically and at scale. The fact that there were missing mappings that even UMLS inference couldn’t retrieve led to establishing the Biomappings project for prediction and semi-automated curation of semantic mappings. The underlying technology stack from Biomappings eventually got spun out to SSSOM Curator and is now fully domain-agnostic. ### Different Precision or Conflicts Another challenge with semantic mappings is when different resources have different level of precision. In the example below, OrphaNet uses low-precision mapping predicates (i.e., `oboInOwl:hasDbXref`) while MONDO uses high-precision mapping predicates (i.e., `skos:exactMatch`). It makes sense to take the highest quality mapping in this situation, but having a coherent software stack to do this at scale was the big challenge (solved by SeMRA). This can get a bit dicier when there might be conflicting information, for example, if one resource says exact match and another says broader match. In SeMRA, I devised a confidence assessment scheme (which should get its own post later). ### Common Conflations There are three flavors of conflations that make curating and reviewing mappings difficult that I want to highlight. #### Different Ontology Encodings Classes, instances, and properties are mutually exclusive by design. This means that any semantic mappings between them are nonsense, but there are many situations where these mappings might get produced by an automated system or by a curator who is less knowledgable about the ontology aspect of semantic mappings. There’s also a much more subtle discussion about classes, instances, and metaclasses ( see this discussion) that I would set aside. As a concrete example, the Information Artifact Ontology (IAO) has a class that represents the section of a document that contains its abstract: abstract (IAO:0000315). Schema.org has an annotation property whose range is a creative work and whose domain is the text of the abstract itself: schema:abstract. These both have the same label `abstract`, which means that it’s possible to conflate (i.e., accidentally map them). #### Different Entity Types The second kind of conflation is even more subtle, when two classes, instances, or properties come from similar but distinct hierarchies. For example, there’s a subtle difference between what is a phenotype and what is a disease. Ontologies are highly apt at encoding this subtlety with _axioms_ that can then be used by reasoners. This can become a problem for curating and reviewing semantic mappings because some diseases are named after the phenotype that it presents or that causes it. Using MeSH’s disease hierarchy and HPO’s phenotype hierarchy as an example, we can see that Staghorn Calculi (mesh:D000069856) and Staghorn calculus (HP:0033591) should not get mapped. Many more examples can be produced (which also show there are even more subtleties here) using SSSOM Curator with the command: `sssom_curator predict lexical doid hp`. See the SSSOM Curator documentation for more information on the lexical matching workflow. #### Different Senses The basic formal ontology (BFO) is an upper-level ontology that is used by many ontologies, including almost the entire Open Biomedical Ontologies (OBO) Foundry. However, as Chris Mungall described in his blog post, Shadow Concepts Considered Harmful, there are many different senses in which an entity can be described, each falling under a different, mutually exclusive branch of BFO. The figure below, from Chris’s post, represents different senses in which a human heart can be described: This problem is particularly bad in disease modeling. Here are only a few examples (of many more) that illustrate this: * the Ontology for General Medical Science (OGMS) term for disease (OGMS:0000031), the Experimental Factor Ontology (EFO) term for disease (EFO:0000408), Monarch Disease Ontology (MONDO) term for disease (MONDO:0000001) is a disposition (BFO:0000016) * the Gender, Sex, and Sexual Orientation Ontology (GSSO) term for disease (GSSO:000486) is a process (BFO:0000015) * the Human Disease Ontology (DOID) informally mentions that a disease is a disposition, but doesn’t make an ontological commitment to BFO * many more controlled vocabularies including NCIT, SNOMED-CT, and MI have their own terms for diseases but don’t use BFO as an upper-level ontology nor are constructed in a way conducive towards integration with other ontologies Schultz _et al._ (2011) proposed a way to formalize the connections between the various senses for diseases in Scalable representations of diseases in biomedical ontologies. However, the OBO community has yet to resolve the long and taxing discussion on how to standardize disease modeling practices. For semantic mappings, this becomes a problem because a reasoner will explode if diseases under two different BFO branches get marked as equivalent, because the BFO upper level terms are marked as disjoint - this is a feature, not a bug. However, while useful for creating carefully constructed, logically (self-)consistent descriptions of diseases, these modeling choices can be confusing when curating or reviewing mappings. These modeling choices might not be so important in downstream applications, such as assembling a knowledge graph to support graph machine learning, where many different knowledge sources with lower levels of accuracy and precision must be merged. In practice, I have merged triples using conflicting senses for diseases in a useful way, without issue. ### Interpretation is Important While the last few examples were cautionary tales for when things (probably) shouldn’t be mapped, the next examples are about when things (probably) should be mapped. #### Definitions Here are three vocabularies’ terms for proteins and their textual definitions (though, many more contain their own term for proteins): Entity | Label | Description ---|---|--- wikidata:Q8054 | protein | biomolecule or biomolecule complex largely consisting of chains of amino acid residues SIO:010043 | protein | A protein is an organic polymer that is composed of one or more linear polymers of amino acids. PR:000000001 | protein | An amino acid chain that is canonically produced _de novo_ by ribosome-mediated translation of a genetically-encoded mRNA, and any derivatives thereof. As semantic mapping curator, we have two options: 1. We can reasonably assume that the intent from all three resources was to represent the same thing, despite the definitions being quite different. This assumption can be built on our prior knowledge about what a protein is, why Wikidata, SIO, and PR exist, and then infer the intent of the term’s definition’s author 2. We can make a very literal reading of the definition and conclude that these three terms represent very different things I think that the latter is really unconstructive for several reasons, but I have worked with colleagues, especially from the linguistics background, who take this approach. First, this is unconstructive because it means you’ll probably never map anything. Second, if you want to be rigorous, use an ontology formalism with proper logical definitions. For example, the Cell Ontology (CL) exhaustively defines its cells using appropriate logical axioms. However, this also has a caveat, that to make mappings based on logical definitions, then the different modelers have to agree on the same axioms and same modeling paradigm. As far as I know, there aren’t any groups out there that use the same modeling paradigm that haven’t just combine forces to work on the same resource. So we’re stuck back at option 1 either way :) #### Context Sometimes Matters In contrast to the discussion about mapping phenotypes and diseases, there are context-dependent reasons to make semantic mappings, which can be illustrated in biomedicine using genes and proteins. Let’s start with some definitions: 1. SO:0000704 A gene is a region of a chromosome that encodes a transcript 2. PR:000000001 A protein is a chain of amino acids The biomedical literature often uses gene symbols to discuss the proteins they encode. While this isn’t precise, it’s still useful in many cases. Therefore, when reading the COVID-19 literature, you will likely see discussion of the IL6-STAT cascade, where IL6 is the HGNC gene symbol for the Interleukin 6 protein. Most of the time, the HGNC approved gene symbol is an initialism or other abbreviation of the protein, but this isn’t always the case. Similar to the literature, many pathway databases that accumulate knowledge about the processes and reactions in which proteins take part actually use gene symbols (or other gene identifiers) to curate proteins. The take-home message here is that genes and proteins are indeed not the same thing, but in some contexts, it’s useful to map between them. There’s also a compromise - the Relation Ontology (RO) has a predicate has gene product (RO:0002205) that explicitly models the relationship between IL6 and Interleukin 6, which can then be automatically inferred to mean a less precise mapping for certain scenarios (SeMRA implements this). Outside of biomedicine, I have also heard that context-specific mappings are very important in the digital humanities. As I’m better understanding the use cases of colleagues in other NFDI Consortia that focus on the digital humanities, I will try and update this section to have alternate perspectives. ### Evidence A key challenged that motivated the development of SSSOM as a standard was to associate high-quality metadata with semantic mappings, such as the reason the mapping was produced (e.g., manual curation, lexical matching, structural matching), who produced it (e.g., a person, algorithm, agent), when, how, and more. We developed the Semantic Mapping Vocabulary (semapv) to encode different kinds of evidence such as for manual curation of mappings, lexical matching, structural matching, and others. SSSOM is well-suited towards capturing simple evidences (blue). #### Provenance for Inferences The purple evidence from the figure in the last section requires a more detailed data model to represent provenance for inferred semantic mappings that simply doesn’t fit in the SSSOM paradigm (and it shouldn’t be hacked in, either). I proposed a more detailed data model for capturing how inference is done in Assembly and reasoning over semantic mappings at scale for biomedical data integration and provided a reference implementation in the Semantic Reasoning Mapper and Reasoner (SeMRA) Python software package. Here’s what that data model looks like, which also has a Neo4j counterpart: ### Negative Semantic Mappings SSSOM also has first-class support for encoding _negative_ relationships, meaning that the following can be represented: This means that SSSOM curators can keep track of non-trivial negative mappings, e.g., when curating the results of semantic mapping prediction or automated inference. In a semi-automated curation loop, this allows us to avoid re-reviewing zombie mappings over and over again. High quality, non-trivial negative mappings also enable more accurate machine learning, as opposed to using negative sampling. For example, we have been working on developing graph machine learning-based ontology matching and merging using PyKEEN (a graph machine learning package I helped develop and maintain). An open challenge is that we neither have support from data modeling formalisms (e.g., ontologies in OWL, knowledge graphs in RDF or Neo4j) to encode negative knowledge (in this case negative mappings) nor tooling support. This means that when we output SSSOM to RDF, we use our own formalism, which won’t be correctly recognized by any other tooling that wasn’t developed with SSSOM in mind. I’m keeping notes about this in a separate post about negative knowledge that I update periodically. * * * Despite the challenges, I think that the mapping world is actually getting quite mature. I am currently working with NFDI and RDA colleagues to further unify the SSSOM and JSKOS worlds, especially given that the Cocoda mapping curation tool solved many of these problems (from the digital humanities perspective) many years ago, and we simply were unaware of it. I hope this post can continue as a living document - if I missed something, please let me know and I will update the post to include it!
cthoyt.com
January 20, 2026 at 1:25 PM
i've written a few blog posts lately on semantic mappings, SSSOM, JSKOS, and automated assembly of data and knowledge. i'm also always very proud to do this by hand, without AI

1. SSSOM and Wikidata: https://cthoyt.com/2026/01/08/sssom-to-wikidata.html
2. SSSOM and JSKOS […]
Original post on scholar.social
scholar.social
January 16, 2026 at 4:26 PM
a data modeling language that errors when _real_ examples aren't given for all fields, all structures, all everythings
January 14, 2026 at 4:03 PM
every wanted to put semantic mappings in SSSOM into @wikidata

now you can

https://cthoyt.com/2026/01/08/sssom-to-wikidata.html
Mapping from SSSOM to Wikidata
At the 4th Ontologies4Chem Workshop in Limburg an der Lahn, I proposed an initial crosswalk between the Simple Standard for Sharing Ontological Mappings (SSSOM) and the Wikidata semantic mapping data model. This post describes the motivation for this proposal and the concrete implementation I’ve developed in `sssom-pydantic`. This work is part of the NFDI’s Ontology Harmonization and Mapping Working Group, which is interested in enabling interoperability between SSSOM and related data standards that encode semantic mappings. The TL;DR for this post is that I implemented a mapping from SSSOM to Wikidata in `sssom-pydantic` in cthoyt/sssom-pydantic#32. One high-level entrypoint is the following function, which reads an SSSOM file and prepares QuickStatements which can be reviewed in the web browser, then uploaded to Wikidata. This script can be run from Gist with `uv run https://gist.github.com/cthoyt/f38d37426a288989158a9804f74e731a#file-sssom-wikidata-demo-py` ## Semantic Mappings in SSSOM The Simple Standard for Sharing Ontological Mappings (SSSOM) is a community-driven data standard for semantic mappings, which are necessary to support (semi-)automated data integration and knowledge integration, such as in the construction of knowledge graphs. While SSSOM primary a tabular data format that is best serialized in TSV, it uses LinkML to formalize the semantics of each field such that SSSOM can be serialized to and read from OWL, RDF, and JSON-LD. Here’s a brief example: subject_id | subject_label | predicate_id | object_id | object_label | mapping_justification ---|---|---|---|---|--- wikidata:Q128700 | cell wall | skos:exactMatch | GO:0005618 | cell wall | semapv:ManualMappingCuration wikidata:Q47512 | acetic acid | skos:exactMatch | CHEBI:15366 | acetic acid | semapv:ManualMappingCuration ## Semantic Mappings in Wikidata Wikidata has two complementary formalisms for representing semantic mappings. The first uses the exact match (P2888) property with a URI as the object. For example, cell wall (Q128700) maps to the Gene Ontology (GO) term for cell wall by its URI `http://purl.obolibrary.org/obo/GO_0005618`. The second formalism uses semantic space-specific properties (e.g. P683 for ChEBI) with local unique identifiers as the object. For example, acetic acid (Q47512) maps to the ChEBI term for acetic acid using the P683 property for ChEBI and local unique identifier for acetic acid (within ChEBI) `15366`. Wikidata has a data structure that enables annotating qualifiers onto triples. Therefore, other parts of semantic mappings modeled in SSSOM can be ported: 1. Authors and reviewers can be mapped from ORCiD identifiers to Wikidata identifiers, then encoded using the S50 and S4032 properties, respectively 2. A SKOS-flavored mapping predicate (i.e., exact, narrow, broad, close, related) can be encoded using the S4390 property 3. The publication date can be encoded using the S577 property 4. The license can be mapped from text to a Wikidata identifier, then encoded using the S275 property Note that properties that normally start with a `P` when used in triples are changed to start with an `S` when used as qualifiers. Other fields in SSSOM could potentially be mapped to Wikidata later. ### Finding Wikidata Properties using the Semantic Farm The Semantic Farm (previously called the Bioregistry) maintains mappings between prefixes that appear in compact URIs (CURIEs) and their corresponding Wikidata properties. For example, the prefix `CHEBI` maps to the Wikidata property P683. These mappings can be accessed in several ways: 1. via the Semantic Farm’s SSSOM export. Note: this requires subsetting to mappings where Wikidata properties are the object. 2. via the Semantic Farm’s live API, 3. via the Bioregistry Python package (this will get renamed to match Semantic Farm, eventually) using the following code: import bioregistry # get bulk prefix_to_property = bioregistry.get_registry_map("wikidata") # get for a single resource resource = bioregistry.get_resource("chebi") chebi_wikidata_property_id = resource.get_mapped_prefix("wikidata") ## Notable Implementation Details I’ve previously built two package which were key to making this work: 1. `wikidata-client`, which interacts with the Wikidata SPARQL endpoint and has high-level wrappers around lookup functionality. I’m also aware of WikidataIntegrator - I’ve contributed several improvements, but working with its codebase doesn’t spark joy and the last time I tried to use it, it was fully broken due to some of its dependencies not working on modern Python. 2. `quickstatements-client`, which implements an object model for QuickStatements v2 and an API client. Along the way to this PR, I made improvements to the wikidata-client in cthoyt/wikidata-client#2 to add high-level functionality for looking up multiple Wikidata records based on values for a property (e.g., to support ORCID lookup in bulk). All other changes were made in `sssom-pydantic` in cthoyt/sssom-pydantic#32. The other key challenge was to avoid adding duplicate information to Wikidata - unlike a simple triple store, we could accidentally end up with duplicate statements. Therefore, the sssom-pydantic implementation looks up all existing semantic mappings in Wikidata for entities appearing in an SSSOM file, then filters appropriately to avoid uploading duplicate mappings to Wikidata. ## Pulling it All Together This new module in `sssom-pydantic` implements the following interactive workflows: 1. Read an SSSOM file, convert mappings to Wikidata schema, then open a QuickStatements tab in the web browser using `read_and_open_quickstatements()` 2. Convert in-memory semantic mappings to the Wikidata schema, then open a QuickStatements tab in the web browser using `open_quickstatements()` Here’s what the QuickStatements web interface looks like after preparing some demo mappings: It also implements the following non-interactive workflows, which should be used with caution since they write directly to Wikidata: 1. Read an SSSOM file, convert mappings to Wikidata schema, then post non-interactively to Wikidata via QuickStatements using `read_and_post()` 2. Convert in-memory semantic mappings to the Wikidata schema, then post non-interactively to Wikidata via QuickStatements using `post()` * * * I’m a bit hesitant to start uploading SSSOM content to Wikidata in bulk, because I don’t yet have a plan for how to maintain mappings that might change over time in their upstream single source of truth, e.g., mappings curated in Biomappings. Otherwise, I think this is a good proof of concept and would like to get feedback about additional qualifiers that could be added, and if the ones I chose so far were the best.
cthoyt.com
January 8, 2026 at 5:54 PM
the Variants and Us (VUS) Podcast did an episode focused on biocuration last summer, with focus on genomics and analysis of variants:

https://open.spotify.com/episode/3WphkfnZQMbS0q97wYZ3Ze?si=HyskwSw3RGig9pXMvhp2JA

def relevant for the @biocurator community
Biocuration: from Evidence to Classification
Variants and Us (VUS) Podcast · Episode
open.spotify.com
January 5, 2026 at 4:49 PM
xoxo.zone
December 20, 2025 at 11:16 AM
i've written about my experience at the @deNBI BioHackathon Germany 2025 with the @dalia, TeSS, and Bioschemas teams #bhg2025

https://cthoyt.com/2025/12/09/biohackathon-de-2025.html
Machine-Actionable Training Materials at BioHackathon Germany 2025
I recently attended the 4th BioHackathon Germany hosted by the German Network for Bioinformatics Infrastructure (de.NBI). I participated in the project _On the Path to Machine-actionable Training Materials_ in order to improve the interoperability between DALIA, TeSS, mTeSS-X, and Schema.org. This post gives a summary of the activities leading up to the hackathon and the results of our happy hacking. ## Team Our project, On the Path to Machine-actionable Training Materials, had the following active participants throughout the week: * Nick Juty & Phil Reed (University of Manchester) * Leyla Jael Castro & Roman Baum (Deutsche Zentralbibliothek für Medizin; ZB Med) * Petra Steiner (University of Darmstadt) * Oliver Knodel & Martin Voigt (Helmholtz-Zentrum Dresden-Rossendorf; HZDR) * Dilfuza Djamalova (Forschungszentrum Jülich; FZJ) * Jacobo Miranda (European Molecular Biology Laboratory; EMBL) Nick and Petra were our team leaders and Phil acted as the project’s _de facto_ secretary. On the first day of the hackathon, we were briefly joined by Alban Gaignard (Nantes University), Dimitris Panouris (SciLifeLab), and Harshita Gupta (SciLifeLab) to present their current related work. Similarly, Dominik Brilhaus (Heinrich-Heine-Universität Düsseldorf) joined on the first day to share his perspective from DataPLANT (the NFDI consortium for plants) as a training materials creator. Finally, Helena Schnitzer (FZJ) participated in some Schema.org discussions through the week. ## Goals We categorized our work plan into three streams: 1. **Training Material Interoperability** - survey the landscape of relevant ontologies and schemas for annotating learning materials, curate mappings/crosswalks between existing data models, develop a programmatic toolbox, and begin federating between training material platforms 2. **Training Material Analysis** - analyze training materials at scale to group similar training materials, reduce redundancy, and semi-automatically construct learning paths 3. **Modeling Learning Paths** - collect use cases and develop a (meta)data model for learning paths ## Training Material Interoperability Interoperability is third pillar of the FAIR data principles. Metadata describing training materials may be captured and stored in one of several data models including the DALIA Interchange Format (DIF) v1.3, the format implicitly defined by the TeSS API, and the Schemas.org Learning Material profile. Further, metadata records conforming to these data models are filled with references to terms in other ontologies, controlled vocabularies, databases, and other resources that mint (persistent) identifiers. Our overarching goal at the hackathon was to improve interoperability on both levels. ### Indexing Ontologies and Schemas Our first concrete goal for training material interoperability at the hackathon was to survey ontologies, controlled vocabularies, databases, and other resources that mint (persistent) identifiers that might appear in the metadata describing a learning material. For example, TeSS uses the EDAM Ontology to annotate topics onto training materials. For the same purpose, DALIA uses the Hochschulcampus Ressourcentypen (I’ll say more on how we deal with the conflicting resources in the section below on mappings). Our second concrete goal was to survey schemas that are used in modeling open educational resources and training materials, for example, Schema.org, OERSchema, and MoDALIA, which encodes the DALIA Interchange Format (DIF) v1.3. The Semantic Farm (https://semantic.farm) is comprehensive database of metadata about resources that mint (persistent) identifiers (e.g., ontologies, controlled vocabularies, databases, schemas) such as their preferred CURIE prefix for usage in SPARQL queries and other semantic web applications. It imports and aligns with other databases like Identifiers.org (for the life sciences) and BARTOC (for the digital humanities) to support interoperability and sustainability. It follows the open data, open code, and open infrastructure (O3) guidelines and has well-defined governance to enable community maintenance and support longevity. It’s the perfect place to index all the learning material and open educational resource-related ontologies, controlled vocabularies, databases, and schemas. I gave a tutorial on how to search the Semantic Farm for ontologies, controlled vocabularies and other resources that mint (persistent) identifiers, and how to contribute any that are missing. In short, they can be contributed by filling out the new prefix request template on GitHub. If you’re interested to add a new entry, you can directly use the form, read the contribution guidelines, or watch a short YouTube tutorial. While I had done some significant preparatory work before the hackathon by creating many new entries in the Semantic Farm, the team found and added several new and important entries to the Semantic Farm during the hackathon too. Here are two highlights: Martin Voigt contributed the prefix `amb` for the Allgemeines Metadatenprofil für Bildungsressourcen (General Metadata Profile for Educational Resources) in biopragmatics/bioregistry#1781. This is a metadata schema for learning materials produced by the Kompetenzzentrum Interoperable Metadaten (KIM) within the Deutsche Initiative für Netzwerkinformation e.V. that was heavily inspired by Schema.org and the Dublin Core Learning Resource Metadata Initiative (LRMI) Dilfuza Djamalova and Jacobo Miranda contributed the prefix `gtn` for Galaxy Training Network training materials in biopragmatics/bioregistry#1779. This resource contains multi- and cross-disciplinary training materials for using the Galaxy workflow management system. Below, I describe how we ingested transformed the training materials from GTN into a common format such they can be represented according to the DALIA Interchange Format (DIF) v1.3, the implicit data model expected by TeSS, and in Schema.org-compliant RDF. Ultimately, we collated relevant ontologies, controlled vocabularies, schemas and other resources that mint (persistent) identifiers in a collection such that they can be easily found and shared. ### Semantic Mappings and Crosswalks I alluded to the different resources used by TeSS and DALIA to annotate disciplines. The issue of partially overlapping ontologies, controlled vocabularies, and database is quite widespread, and can manifest in a few different ways. The figure above shows that redundancy can arise because of different focus within a domain (i.e., the chemistry example), different hierarchical specificity (i.e., the disease example), and due to massive generic resources having overlap across many domains (e.g., like UMLS, MeSH, NCIT). This is problematic when integrating learning materials from different sources, e.g., TeSS and DALIA, because two learning materials may be annotated with different terms describing the same discipline. Therefore, the solution is to create semantic mappings between these terms. I’ve worked for several years on the Simple Standard for Sharing Ontological Mappings (SSSOM) standard for storing semantic mappings, so this was naturally the target for our work. Further, I have been working on a domain-agnostic workflow for predicting semantic mappings with lexical matching and deploying a curation interface called SSSOM Curator. I gave a tutorial for using SSSOM Curator to the team based on a previous tutorial I made (that can be found on YouTube here). We prepared predicted semantic mappings between several learning material-related ontologies in biopragmatics/biomappings#204, but we didn’t prioritize semantic mapping curation during the hackathon. Here’s what they look like in the SSSOM Curator interface for Biomappings: Where curating correspondences between concepts in ontology, controlled vocabularies, and databases is often called semantic mapping, curating correspondences between schemas and properties therein is often called crosswalks. We put a bigger emphasis on producing crosswalks between Schema.org and MoDALIA. This is actually a more complex problem due to the fact that correspondences between elements in schemas can be more sophisticated (e.g., mapping between two fields for first and last names to a single name field), but there are at least a few places where properties can be mapped with SSSOM. An interesting lesson learned is that some curators find using SKOS relationships challenging because the narrow and broader relations have the opposite direction than what they would expect. For example, `X skos:narrowMatch Y` means that X is narrower than Y, not X has a narrow match Y. Many vocabularies use a verb as part of the predicate to reduce this confusion - I’m sure if it were `X skos:isNarrowMatchFor Y`, then this would not have been a problem. Deep down, the real issue is that transparent identifiers (i.e., human-readable ones) are bad, since they can’t be changed over time. See the excellent article, Identifiers for the 21st century, by McMurry _et al._ (2017) for a more detailed discussion on what makes a good identifier. ### Operationalizing Crosswalks The next step was to translate the abstract crosswalks between DALIA, TeSS, and Schema.org into a concrete implementation using a general purpose programming language (i.e., Python). #### The Scaling Problem Given that we only focused on these three data models, it’s not unrealistic to produce a DALIA-TeSS crosswalk, TeSS-Schema.org crosswalk, and DALIA-Schema.org crosswalk. However, this approach does not scale well - in general, it requires curating and implementing ${N}\choose{2}$ crosswalks with $N$ being the number of schemas. An alternative is to use a hub-and-spoke model, in which one data model is targeted as the intermediary used for interchange and storage. This reduces the burden on curators of crosswalks, as they only have to curate a single crosswalk for any given data model into the intermediary. Similarly, it reduces the burden on code maintainers as only a single crosswalk has to be implemented. The challenge with open educational resources and learning materials is that no existing data model is sufficient to cover the (most important) aspects of all other data models. This motivated us to implement a unified, generic data model for learning materials to serve as the interoperability hub between DALIA, TeSS, Schema.org, and other data models. graph TD subgraph alltoall ["All-to-All (complex, burdensome)"] dalia[DALIA] <--> tess[TeSS] dalia <--> schema[Schema.org] dalia <--> oerschema[OERschema] dalia <--> amb["Allgemeines Metadatenprofil für Bildungsressourcen (AMB)"] dalia <--> lrmi["Learning Resource Metadata Initiative (LRMI)"] dalia <--> erudite[ERuDIte] tess <--> schema tess <--> oerschema tess <--> amb tess <--> lrmi tess <--> erudite schema <--> oerschema schema <--> amb schema <--> lrmi schema <--> erudite oerschema <--> amb oerschema <--> lrmi oerschema <--> erudite amb <--> lrmi amb <--> erudite lrmi <--> erudite end subgraph hub ["Hub-and-Spoke (maintainable, extensible)"] direction TB hubn[Unified OER Data Model] <--> daliaspoke[DALIA] hubn[Unified OER Data Model] <--> tessspoke[TeSS] hubn[Unified OER Data Model] <--> schemaspoke[Schema.org] hubn[Unified OER Data Model] <--> oerschemaspoke[OERschema] hubn[Unified OER Data Model] <--> ambspoke["Allgemeines Metadatenprofil für Bildungsressourcen (AMB)"] hubn[Unified OER Data Model] <--> lrmispoke["Learning Resource Metadata Initiative (LRMI)"] hubn[Unified OER Data Model] <--> eruditespoke[ERuDIte] end alltoall --> hub The famous XKCD comic, Standards (https://xkcd.com/927), proselytizes that any proposal of a unified standard that covers everyone’s use cases is doomed to be an $N+1$ competing standard. While I’m doing my best to present the work done in preparation for the hackathon and at the hackathon in a linear way, the truth is that most steps also included discussion, hacking, trying, failing, and repeating. Therefore, I can confidently say that for practical reasons, implementing a new _de facto_ standard was the only realistic choice. #### The OERbservatory Data Model During the hackathon, we implemented the open source OERbservatory Python package. I first want to talk about three major features that it includes: 1. a unified, generic object model for open educational resources that’s effectively the union of the best parts of DALIA, TeSS, Schema.org, and a few other data models we found 2. import and export to two open educational resource and learning materials data models - DALIA and TeSS. We didn’t have time during the hackathon to implement import and export to Schema.org. 3. import from three external learning material repositories - OERhub, OERSI, and the Galaxy Training Network (GTN) Here’s an excerpt of the object model, implemented using Pydantic. Note that Pydantic uses a combination of Python’s type system and type annotations to express constraints and rules, similarly to how SHACL does. However, we get the benefit of Python type checking and the Python runtime to check that we’ve encoded this all correctly. Finally, all Pydantic models can be serialized and deserialized from JSON. class EducationalResource(BaseModel): """Represents an educational resource.""" model_config = ConfigDict(arbitrary_types_allowed=True) reference: Reference | None = Field( None, description="The primary reference for this learning material", examples=[Reference(prefix="dalia", identifier="")] ) title: InternationalizedStr = Field(..., description="The title of the learning material") authors: list[Author | Organization] = Field( default_factory=list, description="An ordered list of authors (i.e., persons or organizations) of the learning material", examples=[ Author(name="Charles Tapley Hoyt", orcid="0000-0003-4423-4370"), Organization(name="NFDI", ror="05qj6w324"), ], min_len=1, ) ... Technology Comparison (content warning: programming culture wars) DALIA and Schema.org built on top of semantic web principles. Records about learning materials encoded in these data models are stored in RDF and queryable via SPARQL. However, while powerful, SPARQL is a querying language that is inherently limited in its expressibility and utility. A general purpose programming language is more suited for building data science workflows, search engines, APIs, web interfaces, and other tools on top of open educational resource and learning material data. That's why we emphasized concretizing the crosswalks between DALIA, TeSS, and Schema.org in a software implementation. We chose Python as the target language because of its ubiquity and ease of use. When the TeSS platform was initially developed in the early 2010s, the Ruby programming language and the Ruby on Rails framework were a popular choice for developing web applications. Unfortunately, the scientific Python stack and machine learning ecosystem led Python to being a clear winner for academics and scientists. This creates an issue that only a small number of academics are skilled in Ruby and can participate in the development of TeSS. It was also crucial that we used Python such that our implementation was reusable. For example, the DALIA 1.0 platform was implemented using Django, which made it effectively impossible to reuse any of the underlying code outside, e.g., in a data science workflow. The same issue is also true for the TeSS implementation using Ruby-on-Rails. While these batteries-included frameworks can get a minimal web application running quickly, they generally lead developers towards writing code that isn't reusable. #### OERbservatory as an Interoperability Hub between DALIA and TeSS Before we even started working on the OERbservatory, we had implemented two packages for working with data in DALIA and TeSS: 1. data-literacy-alliance/dalia-dif implements a parser for the DALIA DIF v1.3 tabular format, an internal representation of the content (also using Pydantic), and an RDF serializer (using on pydantic-metamodel 2. cthoyt/tess-downloader implements an API client to TeSS and an internal representation of the learning resource data model (using Pydantic) Because each of these packages already implemented an internal (lossless) representations of the data models for DALIA and TeSS, respectively, we only had to write code in the OERbservatory that mapped the fields between them to OERbservatory’s data model. This was a **big** milestone towards interoperability. We demonstrated its potential by programmatically downloading all learning materials from the ELIXIR TeSS instance’s API and exporting them as DALIA RDF. Similarly, we converted all learning materials curated for DALIA into the TeSS JSON format. Later, I’ll describe how we took this workflow one step further to implement syncing between DALIA and TeSS. Note that this mapping can’t simply be expressed using SSSOM, SHACL, or other declarative languages, because it relies on more sophisticated logic. For example, topics annotated with ontology terms in the DALIA data model only store the URI reference, whereas topics annotated with ontology terms in the TeSS data model require both the URI reference and the term’s label. Since we’re encoding our crosswalks using a general purpose programming language, we have a larger toolkit available. Here, we could use PyOBO, a generic package I’ve written for working with ontologies, for looking up labels. Unfortunately, we did not have time to implement an importer/exporter for Schema.org. We deprioritized this because Schema.org it felt the least approachable due to the way its documentation is written, the complexity of its models, and prolific use of mixins. We considered if we could automatically generate Pydantic classes from Schema.org - and it turns out that pydantic-schemaorg has already done it! Unfortunately, the code is not compatible with modern versions of Pydantic, and the project appears abandoned. We only had so much time at the hackathon, so forking/reviving/rewriting `pydantic-schemaorg` was left as a task for later. #### The OERbservatory as an Aggregator Besides open educational resources and learning materials that are encoded in the DALIA, TeSS, and Schema.org formats, there are many repositories of learning materials that do not conform to a well-defined schema. Prior to the hackathon, I had already explored the Austrian OERhub and Open Educational Resources Search Index (OERSI) and written importers into `dalia-dif`. At the hackathon, I reimplemented those importers using the newly formed OERbservatory unified, generic data model. On the Thursday morning of the BioHackathon, I had an excellent mob programming session with Dilfuza Djamalova and Jacobo Miranda to import training materials from the Galaxy Training Network (GTN). It turns out that there are already several open educational resources and learning materials that are automatically scraped and imported by TeSS. However, those importers are limited by TeSS’s relatively rigid data model, which is bound to their database and can therefore not be easily evolved. Dilfuza and Jacobo had a few goals for our hacking: * There are fields in GTN that aren’t yet captured by TeSS. They wanted to implement those fields in OERbservatory, demonstrate their usage, then gently nudge TeSS to evolve its data model to support their use cases * They wanted to index their content in DALIA, which becomes much easier if they only have to maintain one importer in OERbservatory which can already export to DALIA * GTN is part of the DeKCD consortia, which wants to deduplicate training material. Adding an importer here gives access to the workflows we’re building for reconciling different metadata curated in different places about the same materials, and identifying similar materials to reduce duplicate effort, and connect people working on the same kinds of materials We implemented the GTN importer in data-literacy-alliance/oerbservatory#8 which covers tutorials in GTN and later could be extended to slide decks. Along the way, we updated the main educational resource model in OERbservatory to include a few new fields, including status (which also is shared by TeSS- that now needs to be incorporated), the publication date, and the modified date. We did not make a complete mapping for all fields in GTN due to time constraints, so we implemented logging that summarizes fields that haven’t yet been mapped (see the PR for examples of each). For example, the way that contributor information is incorporated into the API from the frontmatter in the source is interesting - it resolves the keys in the frontmatter to entries in this YAML file in the GTN GitHub repository. We will want to think about the best way to map the authors into OERbservatory, and this also might be a time to extend the author list to include contributor role annotations. I was very excited that Dilfuza and Jacobo were motivated to work on this and contribute following the hackathon. We see if the OERbservatory is approachable enough for future external contributions! For example, Robert Hasse of NFDI4BIOIMAGE already proactively prepared a script that exports their consortium’s training materials into the DALIA DIF v1.3 tabular format. I don’t consider this a very approachable format, and I’m sure efforts like his could have been eased by using OERbservatory as a target. The next steps are to incorporate the Swiss Digital Research Infrastructure for Arts and Humanities (DARIAH-CH) and Physical Sciences Data Infrastructure (PSDI) learning materials, which appeared on the schematic diagram for OERbservatory earlier. There are also a lot of other potential learning material repositories to scrape like Glittr.com. If you have a suggestion, you can drop it in the OERbservatory issue tracker. Further, given that Martin Voigt was in the room during this hacking and discussion, and he is the maintainer for TeSS’s scraper code, we already started formulating plans on how we might be able to deduplicate efforts. ### Federation of Open Educational Resources and Learning Materials The next step towards interoperability beyond the demonstration of converting between formats used by DALIA and TeSS was to demonstrate actually posting the content to the live services. While we are currently in the process of implementing submission of open educational resources and learning materials in DALIA, TeSS already has a web-based interface for registering new learning materials. TeSS doesn’t have a documented API endpoint for posting learning materials, but luckily, Martin knew where it was and helped to figure out the correct way to pass credentials to use it. We managed this by a combination of reading the Ruby implementation of TeSS and good ‘ol trial and error. In the end, we implemented posting learning materials in the TeSS-specific Python package in cthoyt/tess-downloader#2. Then, it was only a matter of stringing together code that converts DALIA to OERbservatory, OERbservatory to TeSS, and then to upload to TeSS. In parallel, Martin worked on improving the devops behind the PaNOSC TeSSHub to enable quicky spinning up new TeSS instances that each have their own subdomain. He created a different subdomain for each of DALIA, OERSI, GTN/KCD, and OERhub. Finally, we wrote a script that uploaded all open educational resources and learning material from each source to the appropriate TeSS instance in data-literacy-alliance/oerbservatory#3. The results in each space can be explored here: Source | Domain ---|--- DALIA | https://dalia.tesshub.hzdr.de OERhub | https://oerhub.tesshub.hzdr.de OERSI | https://oersi.tesshub.hzdr.de GTN/deKCD | https://kcd.tesshub.hzdr.de PanOSC | https://panosc.tesshub.hzdr.de A full list of spaces can be found here. #### European Open Science Cloud The great specter looming over most NFDI-related projects is how to interface with the European Open Science Cloud (EOSC). At the surface, EOSC is a massive undertaking to democratize access to research infrastructure on the European level. However, having just entered the NFDI bubble at the end of the summer, I have bene overwhelmed by the high pressure to participate in EOSC combine with the lack of funding and lack of direction on how to best go about doing that. All of that being said, Oliver Knödel spend the hackathon preparing the concept for how we could connect TeSSHub to the EOSC open educational resource and training materials registry using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Once TeSSHub can demonstrate federating its content through this mechanism, we can use as inspiration to make a generic implementation in OERbservatory. #### Governance and Provenance Now that it’s possible to copy training materials from one platform to another, we have started to consider governance and provenance issues like: * If a training material originally curated in DALIA is displayed in TeSS, how is that attributed? We will have to carefully consider how metadata records about learning resources are identified, and how those identifiers are passed around during interchange/syncing. * If a training material originally from TeSS is enriched in the DALIA platform, should that information flow back to TeSS, and how? We will have to carefully consider how information is deduplicated and reconciled * How do we implement technical systems that can keep many federated platforms up-to-date with each other? I’m sure there will be many more questions along these lines. Luckily, the mTeSS-X group has already begun discussions on a smaller scale, since they care about how to federate between many disparate TeSS instances. ## Training Material Analysis Our team split into two for the analysis of training materials. The first team looked into algorithmic mechanisms for featurizing open educational resources and learning materials and applications of those features. The second team looked into using large language models (LLMs) for the automated construction of learning paths. ### Featurization and Application The first team looked into two techniques for featurizing (i.e., assigning dense vectors) to open educational resources and learning materials. The first and most interpretable technique was to concatenate free text fields and labels from structured fields from a learning resource and index the entire corpus (i.e., all learning resources) using the term frequency-inverse document frequency (TF-IDF) algorithm. This does a small amount of text preprocessing, calculates a word list for the entire corpus, then calculates for each word the likelihood of appearance in a given learning material versus the entire corpus. Then, each learning material is assigned a dense vector with values from $[0, 1]$ the length of the word list. Learning materials can be compared, e.g., using cosine similarity between their respective vectors. The second technique was to use the sentence transformers machine learning architecture, which relies on a pre-trained (not large) language model to accomplish a similar vectorization. Both methods run in less than a few minutes for the corpus of learning resources from DALIA, TeSS, OERHub, and OERSI. We also pre-calculated the all-by-all similarities and applied a cutoff of 0.7 to shorten the list. Both the TF-IDF and sentence transformers vector index and similarities are commit to the OERbservatory repository and are available here. After we had embeddings, Dilfuza began to investigate some of the following: 1. Identify duplicates metadata records corresponding to the same learning material resource, e.g., when two different platforms scraped the same learning material 2. Semi-automatically identify similar training materials both to improve suggestions to learners, to connect the learning material creators, and to help de-duplicate training material creation efforts We only managed to get this far in the last day of the hackathon, so there is still a lot more to do here! Originally, I had planned on also using these embeddings to train classifiers for key provenance metadata such as topic, target audience, and difficulty level, then to create a semi-automated curation workflow for enriching learning materials whose records were sparse with annotation. These will be next steps. ### Automated Construction of Learning Paths Nick looked into using large language models (LLMs) to construct learning paths through machine-assisted dialog. This part is highly experimental so there isn’t much to point to yet, but the idea was to take in a list of learning materials (either hard-coded or as a URL for the chat system to retrieve) and a prompt to ask the LLM ot collect similar materials base don objectives and keywords, then create a learning path based on difficult (which is infrequently annotated) and suggest a title. This workflow was used to produce three learning paths on the following topics that were each ordered, had reference links, a difficulty rating, a title, and provider: 1. Sequencing and QC (10 items) 2. Git and Version Control (6 items) 3. Genome Annotation (8 items) More on this in future work! ## Modeling Learning Paths While there isn’t a clear consensus on what a learning path is, a simple definition is that a learning path is a sequence of learning materials to consume to help a learner achieve a specific level of competence on a topic. TeSS implements a data model for learning paths based on this definition and the ELIXIR TeSS instance has eleven examples. Our team had the goal to develop an extension Schemas.org (in Bioschemas) to capture learning paths. For transparency, I didn’t actively participate in this track, but think it’s worth sharing the results, most of which are adapted from Phil’s repository in BioSchemas/LearningPath-sandbox. ### Proposed Data Model Phil, Alban, and Leyla proposed two new Bioschemas profiles and a small change to one Bioschemas profile with the help of Nick and Roman: * `LearningPath`: inherits from `Course` * `LearningPathModule`: inherits from `Course`, `Syllabus`, `ListItem`, and `ItemList` * `TrainingMaterial`: inherits from `LearningResource` and `ListItem` Here’s a class diagram describing the proposed data model, where 🔺 is Schema.org type, 🟩 is Bioschemas profile, 🔵 is new profile: classDiagram direction TB class Event["Event🔺"] { } class CourseInstance["CourseInstance🔺🟩"] { } class Course["Course🔺🟩"] { syllabusSections } class new_LearningPath["new:LearningPath🔵"] { Syllabus[] syllabusSections } class ListItem["ListItem🔺"] { nextItem } class Syllabus["Syllabus🔺"] { } class new_LearningPathModule["new:LearningPathModule🔵"] { ListItem[] itemListElement LearningPathTopic nextItem } class LearningResource["LearningResource🔺"] { } class bio_TrainingMaterial["bio:TrainingMaterial🟩"] { } Course <|-- new_LearningPath Course <|-- new_LearningPathModule Syllabus <|-- new_LearningPathModule ListItem <|-- new_LearningPathModule LearningResource <|-- Course LearningResource <|-- bio_TrainingMaterial LearningResource <|-- Syllabus Event <|-- CourseInstance ### Concrete Example from Galaxy Training Network The team mocked encoding the Introduction to Galaxy and Sequence analysis learning path on TeSS in this new schema. This learning path has the following structure: 1. **Module 1: Introduction to Galaxy** 1. A short introduction to Galaxy 2. Galaxy Basics for genomics 2. **Module 2: Basics of Genome Sequence Analysis** 1. Quality Control 2. Mapping 3. An Introduction to Genome Assembly 4. Chloroplast genome assembly Here’s a mockup of how this could look in RDF: @prefix dct: <http://purl.org/dc/terms/> . @prefix ex: <http://example.org/> . @prefix schema: <https://schema.org/> . ex:GA_learning_path a schema:Course ; dct:conformsTo <https://bioschemas.org/profiles/LearningPath> ; schema:courseCode "GSA101" ; schema:description "This learning path aims to teach you the basics of Galaxy and analysis of sequencing data. " ; schema:name "Introduction to Galaxy and Sequence analysis" ; schema:provider ex:ExampleUniversity ; schema:syllabusSections ex:Module_1, ex:Module_2 . ex:Module_1 a schema:ItemList, schema:ListItem, schema:Syllabus ; dct:conformsTo <https://bioschemas.org/profiles/LearningPathModule> ; schema:itemListElement ex:TM11, ex:TM12 ; schema:name "Module 1: Introduction to Galaxy" ; schema:nextItem ex:Module_2 ; schema:teaches "Learn how to create a workflow" . ex:TM11 a schema:LearningResource, schema:ListItem ; dct:conformsTo <https://bioschemas.org/profiles/TrainingMaterial> ; schema:description "What is Galaxy" ; schema:name "(1.1) A short introduction to Galaxy" ; schema:nextItem ex:TM12 ; schema:url "https://tess.elixir-europe.org/materials/hands-on-for-a-short-introduction-to-galaxy-tutorial?lp=1%3A1" . Here’s the same thing from a graphical perspective: graph TD N1["Module 1: Introduction to Galaxy"] N3["(1.2) Galaxy Basics for genomics"] N1 -- itemListElement --> N3 N1["Module 1: Introduction to Galaxy"] N2["(1.1) A short introduction to Galaxy"] N1 -- itemListElement --> N2 N4["Module 2: Basics of Genome Sequence Analysis"] N8["(2.4) Chloroplast genome assembly"] N4 -- itemListElement --> N8 N2["(1.1) A short introduction to Galaxy"] N3["(1.2) Galaxy Basics for genomics"] N2 -- nextItem --> N3 N1["Module 1: Introduction to Galaxy"] N4["Module 2: Basics of Genome Sequence Analysis"] N1 -- nextItem --> N4 N7["(2.3) An Introduction to Genome Assembly"] N8["(2.4) Chloroplast genome assembly"] N7 -- nextItem --> N8 N4["Module 2: Basics of Genome Sequence Analysis"] N5["(2.1) Quality Control"] N4 -- itemListElement --> N5 N4["Module 2: Basics of Genome Sequence Analysis"] N6["(2.2) Mapping"] N4 -- itemListElement --> N6 N4["Module 2: Basics of Genome Sequence Analysis"] N7["(2.3) An Introduction to Genome Assembly"] N4 -- itemListElement --> N7 N6["(2.2) Mapping"] N7["(2.3) An Introduction to Genome Assembly"] N6 -- nextItem --> N7 N3["(1.2) Galaxy Basics for genomics"] N5["(2.1) Quality Control"] N3 -- nextItem --> N5 N5["(2.1) Quality Control"] N6["(2.2) Mapping"] N5 -- nextItem --> N6 Something that I became aware of while listening to discussions about learning path is the way that Schema.org models lists. I wonder why they don’t use the built-in RDF notions of lists and instead implemented their own formalism. I saw that this caused a lot of confusion for the team both during mocking and also during SPARQL querying. I think the next steps in terms of learning paths is to create a concrete implementation in OERbsevatory - we have the benefit that the Python programming language provides a much more ergonomic abstraction over lists and collections. There’s a lot of content inside the Galaxy Training Network (GTN) that could be ingested into such a learning path. Towards this end, I gave a quick demo of Pydantic to the learning paths team and showed them how I typically go about data modeling. * * * I really enjoyed the BioHackathon, and in general, I am very happy to be attending more events to network with other academics in Germany. It was totally exhausting, too, which is why I didn’t manage to finish this in the week following the event. In other open educational resource and learning materials news, we pre-printed the first ac academic article describing a specific use case for DALIA on arXiv in September: Teaching RDM in a smart advanced inorganic lab course and its provision in the DALIA platform. We’re currently finalizing a second article fully dedicated towards describing the DALIA platform which I hope can go on the arXiv in early January. Stay tuned!
cthoyt.com
December 15, 2025 at 5:32 PM
RE: https://scicomm.xyz/@ORCID_Org/115707294780355754

use ORCID the way they suggest. but also definitely keep using your personal email.

more thoughts on this: https://cthoyt.com/2022/02/06/use-your-personal-email.html
scicomm.xyz
December 12, 2025 at 4:44 PM
Reposted by Charles Tapley Hoyt
Me: urgh, why is Typst using _text_ for italic and *text* for bold, this is a pointless and annoying divergence from the Markdown syntax

Me after a week of using Typst: wait actually this makes so much more sense than Markdown's syntax, all hail Typst
December 8, 2025 at 4:17 PM
biomappings is a project for predicting and curating semantic mappings between biomedical vocabularies in SSSOM

i'm working in @NFDI with researchers from other disciplines, so I recently did a full refactor of the underlying code into a new project, SSSOM Curator ( […]
Original post on scholar.social
scholar.social
November 24, 2025 at 10:25 PM
I spent the entire day curating new prefixes for @bioregistry to support interoperability in the @NFDI -> nearly 100 new prefixes coming up across digital humanities, engineering, computer science, and more
November 18, 2025 at 4:06 PM
@typst it would be cool if I could cite directly using a DOI, and you took care of looking up the metadata from crossref (or wherever) for me.

rather than @mcmurry2017 I would love to be able to do @doi:10.1371/journal.pbio.2001414

this is possible in Manubot (https://manubot.org) - see docs […]
Original post on scholar.social
scholar.social
November 5, 2025 at 2:29 PM
@BeilsteinInstitut I saw you're making an open source database https://github.com/Beilstein-Institut/BChemLookup. I already added a PR to help fix some data errors, would love some feedback.
GitHub - Beilstein-Institut/BChemLookup: An Open Science Initiative for Mapping Common Chemical Abbreviations
An Open Science Initiative for Mapping Common Chemical Abbreviations - Beilstein-Institut/BChemLookup
github.com
November 5, 2025 at 2:16 PM
The EBI has recently published a preprint describing OxO2, the second major version of their ontology mapping service, now based on SSSOM: https://arxiv.org/abs/2506.04286

nice to see citation of SeMRA and reuse of the comprehensive SSSOM semantic mapping datasets we produced and archived on […]
Original post on scholar.social
scholar.social
November 4, 2025 at 10:24 AM
i am low-key offended when people paste text into my google docs that don't have spacing after paragraphs
October 22, 2025 at 3:10 PM
@julian cool to see what you're building in @encyclia. what language are you developing it in?
October 19, 2025 at 4:01 PM
in my first double blog post ever, I wrote about encoding databases as ontologies, the PyOBO software package, and the design choices and philosophy behind the HGNC (@genenames) to ontology converter.

1️⃣ background and software - […]
Original post on scholar.social
scholar.social
October 15, 2025 at 9:45 PM
Reposted by Charles Tapley Hoyt
You are a very busy, very important professor publishing very important work. Do you
a) just publish the code and data along with the paper because you know your work will survive close scrutiny and you have better things to do
b) spend your time handling individual data requests, negotiating […]
Original post on neuromatch.social
neuromatch.social
October 8, 2025 at 9:19 PM
after much ado, I have finished writing about bridging the @nfdi4culture and @NFDI4Chem@nfdi.socialknowledge graphs

📖 https://cthoyt.com/2025/10/07/bridging-culture-and-chemistry.html

the demo was to link experiments in Chemotion electronic lab notebooks […]

[Original post on scholar.social]
October 8, 2025 at 12:15 PM
have you ever annotated a TypedDict onto the **kwargs fuction and want Sphinx to automatically add it to the function's docstring?

class GreetingKwargs(TypedDict):
name: Annotated[str, Doc("the name of the person to greet")]

def greet(**kwargs […]

[Original post on scholar.social]
October 3, 2025 at 3:30 PM
@ktk is there a programmatic way to get the list of prefixes/URI namespaces in a QLever instance?

I would love to add a tool to @bioregistry like the one that gets the prefix list from Virtuoso services and validates it (PR in https://github.com/biopragmatics/bioregistry/pull/1691)
Add Virtuoso prefix map validation by cthoyt · Pull Request #1691 · biopragmatics/bioregistry
Closes #1688 This can be run to validate the NFDI4Culture's endpoint $ bioregistry validate virtuoso https://nfdi4culture.de/sparql note that this doesn't work for all triple stores, just V...
github.com
September 26, 2025 at 9:01 AM
@ResearchOrgs I just read the example organization in your bulk curation form is the "University of ROR" and I think that's very funny :)
September 26, 2025 at 8:34 AM
I made a workflow to pull relations between organizations on @wikidata that have @ResearchOrgs identifiers and put them in a format that could be incorporated into ROR

📖 write-up here https://cthoyt.com/2025/09/25/enriching-ror-with-wikidata.html
Suggesting new relations in ROR from Wikidata
I was looking at the different NFDI consortia in the Research Organization Registry (ROR), and found that the only two that have a parent relations to the NFDI (`ror:05qj6w324`) are NFDI4DS (`ror:00bb4nn95`) and MaRDI (`ror:04ncnzm65`). This felt strange to me, so I started looking around Wikidata to see if I could automatically make a curation sheet to send along to them. I found that Wikidata already has detailed pages for all NFDI consortia, and that they also include relationships to the parent. This blog post is about the steps I took to write a workflow to find relationships in Wikidata that are appropriate for submission to ROR. ## Getting Wikidata In Wikidata, an entity can be annotated with a ROR identifier via property `P6782`. I wanted to write a SPARQL query for the Wikidata Query Service to retrieve all triples for which both the subject and object have and ROR identifier. SELECT ?subject ?subjectROR ?subjectLabel ?predicate ?object ?objectROR ?objectLabel { ?subject ?predicate ?object ; wdt:P6782 ?subjectROR . ?object wdt:P6782 ?objectROR . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } } While I now know this query should return about 67K rows, at the time, I ran into the issue that it was too complicated and caused the Wikidata Query Service to timeout. The next step in any investigation with a blasphemous `?subject ?predicate ?object` pattern is to look into the predicates and try to cut them down. I set to reformulating the query to count the frequency of appearance of each predicate. SELECT DISTINCT ?p ?pLabel (COUNT(?p) as ?count) { ?subject wdt:P6782 ?subjectROR; ?predicate ?object . ?object wdt:P6782 ?objectROR . ?p wikibase:directClaim ?predicate . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } } GROUP BY ?p ?pLabel ORDER BY DESC(?count) This query uses the sneaky `wikibase:directClaim` to map between the `wd:` entity namespace and `wdt:` direct property namespace so the query service could look up the label for the link. The problem was, this query was still too heavy and caused a timeout. Therefore, I had to simplify the query to just get the counts without the label, then use a second query and join the data externally (I also tried a nested query along the way, but it still timed out). SELECT DISTINCT ?predicate (COUNT(?predicate) as ?count) { ?subject wdt:P6782 ?subjectROR ; ?predicate ?object . ?object wdt:P6782 ?objectROR . } GROUP BY ?predicate ORDER BY DESC(?count) With that out of the way, I tried re-writing the original query by formatting in the 147 predicates I pulled out into the `VALUES ?predicate { ... }` (abbreviated), like: SELECT ?subject ?subjectROR ?subjectLabel ?predicate ?object ?objectROR ?objectLabel { VALUES ?predicate { ... } ?subject ?predicate ?object ; wdt:P6782 ?subjectROR . ?object wdt:P6782 ?objectROR . SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". } } This still caused timeouts, so I resorted to a loop in Python, which also let me simplify the query to skip the Wikidata IDs and just pull out RORs for the subject and object (where the `{...}` gets replaced with a different property on each): SELECT ?subjectROR ?objectROR WHERE { ?subjectROR ^wdt:P6782/wdt:{...}/wdt:P6782 ?objectROR . } I really like this because it uses paths to reduce the need to specify the middle entities which don’t get used. I don’t know if the SPARQL engine is able to optimize on it, but it’s cool. Maybe not so readable, but cool. The loop created a super-sized TSV with the predicate and labels added back. The workflow I implemented for this lives in https://github.com/cthoyt/ror-wikidata-enrichment. The data from Wikidata is in this file, licensed under CC0. Do you want this workflow to better reflect your organization? Check out my other blog post on how to curate data about your research organization: https://cthoyt.com/2021/01/17/organization-organization.html. ## Getting ROR I’ve previously implemented a source in PyOBO that wraps downloading and structuring ROR’s data dump into a readily usable format, so getting ROR’s triples was as easy as: import pyobo df = pyobo.get_relations_df("ror") I also had to map the part of and has part relations from BFO to Wikidata properties. I did this by hand because it was faster than doing it the sustainable way, which would have been to pull the mappings from SSSOM-like annotations in the BFO ontology or from Wikidata itself (since I curated those into Wikidata years ago when we were preparing the (unpublished) relation ontology paper). I made an intermediate output of all of thet triples here, licensed under CC0. ## Putting it all together While I’m glossing over a few steps that you can grok by reading my python script, it was possible to finish getting the data in the right shape to compare with tools in PyOBO and the Bioregistry The final step was to take the difference between the Wikidata triples and the ROR triples, filter for triples that make sense within the ROR schema (which for now is just part of and has part relationships), and then dump the results out. There were around 67K records before filtering around 2.8K after filtering. Here are a few examples: subjectROR | subjectLabel | predicate | predicateLabel | objectROR | objectLabel ---|---|---|---|---|--- 00k4nrj32 | Essex County Hospital | P361 | part of | 02wnqcb97 | National Health Service 022efad20 | University of Gabès | P527 | has part(s) | 01hwc7828 | Institut des Régions Arides 04p4gjp18 | Center of Excellence on Hazardous Substance Management | P361 | part of | 028wp3y58 | Chulalongkorn University 04tnv7w23 | École Supérieure Polytechnique d’Antsiranana | P361 | part of | 00pd4qq98 | Université d’Antsiranana 02f4ya153 | Barro Colorado Island | P361 | part of | 01pp8nd67 | Smithsonian Institution ## Coda The point of all of this was to automate adding the missing NFDI consortia relationships to the parent NFDI organization in ROR, because I’m interested in creating queries over the organization landscape related to NFDI to support an upcoming section on Internationalization. And like most things in my work life, I ended up cleaning some data and making upstream contributions along the way. Let’s see how receptive ROR is to this! The triples are all here and I can easily make them a different format for submission. * * * Caveat: if you look into the data, you might notice that some of the entities don’t have labels. I realized this is happening because I haven’t updated my PyOBO importer to get the 2.0 data dump from ROR, and I’m stuck on old version 1.36. This can be fixed independently of this workflow. Here’s the rows related to the NFDI consortia that need new relations, which are all missing labels until I fix this. subjectROR | subjectLabel | predicate | predicateLabel | objectROR | objectLabel ---|---|---|---|---|--- 00enhv193 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 02cxb1m07 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 03xrvbe74 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 020tty630 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 04ncnzm65 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01f5dqg10 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 001jhv750 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 0310v3480 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01d2qgg03 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01k9z4a50 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 03a4sp974 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 05wwzbv21 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 0305k8y39 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 0238fds33 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 03f6sdf65 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 0033j3009 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01vnkaz16 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01v7r4v08 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 04dy2xw62 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 01xptp363 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 034pbpe12 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 05nfk7108 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 00r0qs524 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 00bb4nn95 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur 03fqpzb44 | | P361 | part of | 05qj6w324 | Nationale Forschungsdateninfrastruktur
cthoyt.com
September 25, 2025 at 4:41 PM
new workflow: asking people to contribute to collaborative documents, while simultaneously pleading with them not to fill it with AI slop
September 25, 2025 at 6:08 AM
do what they suggested, but also definitely keep using your personal email and not your institutional email, because https://cthoyt.com/2022/02/06/use-your-personal-email.html

https://scicomm.xyz/@ORCID_Org/115248475944944266
You Should Use a Private Email on Publications
While we were recently preparing to submit a manuscript, the lead author said they looked at my last few papers and noticed I always used a private email address instead of an institutional email address. They asked, perplexed, if they should also use my private email address with our submission. The answer was a resounding _yes_ ; always use a private email address. Here’s why. I actually started thinking about this way back one thousand years ago in 2020 and started an interesting discussion on Twitter: > Lesson for young researchers - don’t use your institutional email address on papers. > > You *will* leave, you won’t get to keep it, and you’ll miss out on lots of people who want to talk to you because of the interesting work you did. > > Try @ORCID_Org instead :) #AcademicChatter > > — Charles Tapley Hoyt (@cthoyt) June 23, 2020 I came back to this again when submitting the Gilda manuscript at the end of 2021, then got a distracted by an ankle injury and sort of lost track of all the blog posts I had been writing. Now I’m finishing it up in early February 2022. The rest of this post is an elaboration on my ideas and follow-up discussion on Twitter. ## You can’t take it with you If you’re a researcher who sometimes writes and submits publications, the chances are pretty high that you currently work at some kind of institution and will not always work at that institution. Here’s what might happen to your institutional email address when you leave: 1. Your old institution doesn’t care about you after you leave, and deletes your account the moment you walk out the door. 2. Your old institution doesn’t care about you after you leave, and says that they will continue your access for a limited amount of time after you leave to save face. Then it deletes your account. 3. Your old institution doesn’t care about you after you leave, and says that they will continue to provide technical support to you and all previous employees indefinitely with their infinite money and benevolence. I’m being sarcastic; this is really, really unlikely. Whether 1, 2, or 3, if you used your institutional email address on a paper you published, it is now a dead link into the abyss. Anyone who might want to get in touch with you to chat about your research (or god forbid, ask you for code or data that you didn’t deposit in an appropriate place before publishing). If you had used your personal email address, which won’t go away, then you wouldn’t have this problem. Alternatively, some publishers allow you to annotate your ORCID identifier on to the manuscript, then you could potentially maintain your current working email address through ORCID, but again, not a lot of publishers support this (yet). ## Who does this most affect? As you become more senior, the chances of you moving institutions decreases. So this is an issue that disproportionately affects young researchers twice: first because you are harder to reach and second because your ability to network is more crucial as a young researcher than to an veteran one. ## Do your best to disregard institutional policy Institutions obviously want as much attribution as possible when you publish while working there, and using a prominent email address is one way to get that goal. Therefore, many institutions have a policy that you have to use your institutional email when publishing. A few courses of actions you could take: 1. Ask the editor to include multiple email addresses 2. Disregard institutional policy, it doesn’t support you as a researcher or a person. ## I’m worried about getting a bunch of spam I can’t speak for all situations, but I’ve been using my personal gmail address all over the internet and still haven’t got a ton of spam. It’s sitting at the bottom of this blog post if you want to prove me wrong. If you’re worried about this, make a new private email account you just use for publishing. ## Difficulties for editors One follow-up conversation I had on Twitter was with John R. Yates III, the editor-in-chief of the Journal of Proteome Research. He gave interesting insight that editors highly prefer institutional email addresses because they are perceived to be more trustworthy. There are several reasons why this is not true (e.g., you could use an outdated address, spoof it, etc.) but ultimately this serves to distract from adopting more sustainable ways of identifying authors and reviewers like with ORCID. ## Relevant Twitter Threads * https://twitter.com/travisdrake/status/1577970169951010817
cthoyt.com
September 22, 2025 at 3:09 PM