Lightnews — Scholar-powered news

Alix Chagué 🌈

@alix-tz.bsky.social

280 followers 160 following 65 posts

PhD candidate at @Inria and @UMontreal working on automatic transcription of manuscripts (HTR). Posts about DH stuff and #HTR_United. More on my research blog: alix-tz.github.io/phd

Posts Replies Media Videos

Alix Chagué 🌈

@alix-tz.bsky.social

Already on my way back, but these past few days I was in Princeton at the IAS to talk about HTR and its future with a whole lot of interesting people I was happy to meet or see again!
The IAS campus is quite a unique place, I'm very happy I was given the opportunity to travel there! 🦌🌳

Photo of the nature surrounding the campus of the Institute for Advanced Study at Princeton

June 14, 2025 at 2:36 PM

Alix Chagué 🌈

@alix-tz.bsky.social

If you are attending #CHR2023 and want to spread the good words about HTR-United (which we don't present at CHR), Thibault and I have lots of stickers in our pockets! Feel free to come talk to us to get some!

A photo of the two very swaggy stickers at your disposal to share with everybody the awesomeness of HTR-United!

December 6, 2023 at 11:56 AM

Alix Chagué 🌈

@alix-tz.bsky.social

Very very honored and delighted to have been awarded the Prize for Open Science and Research Data (Prix Science Ouverte des données de la recherche) with @ponteineptique.bsky.social for our project #HTR_United! It is such an honor!

Screenshot of the tweet posted on November 29 2023 by @ouvrirlascience on X/Twitter to announce the awards attributed to HTR-United.

The two very happy founders and editors of HTR-United, on each side of their trophy

November 29, 2023 at 1:14 PM

Alix Chagué 🌈

@alix-tz.bsky.social

Each catalog entry appears as a card displaying all the information needed to get an overview of the dataset. Each card is given a URI, so you can permanently point to its description, and a Bibtex citation file is generated thanks to the provided metadata, inviting reusers to cite the data!

Screenshot of HTR-United's catalog page where we can see several cards describing datasets.

Screenshot of a "full-size" record on HTR-United (https://htr-united.github.io/share.html?uri=782b1e7da) showing additional information, namely the transcription guidelines and a bibtex citation block.

October 30, 2023 at 2:11 PM

Alix Chagué 🌈

@alix-tz.bsky.social

... metadata generation (for volumes such as lines, characters, etc), and character set documentation and homogenization. It's possible to use them via Github actions so that they are automatically applied to your dataset when you update it (more: htr-united.github.io/actions.html)

October 30, 2023 at 1:54 PM

Alix Chagué 🌈

@alix-tz.bsky.social

... and tools for quality control! This is a big aspect in the ecosystem revolving around HTR-United. Actually, you can find more information about it on our website (htr-united.github.io/tools.html). Tools are designed for: XML file validation, YAML file and catalog validation, ...

October 30, 2023 at 1:43 PM

Alix Chagué 🌈

@alix-tz.bsky.social

We want a more stable and structured way to describe ground truth datasets. The goal is to make the descriptions more reliable so that users can faster identify the datasets best suited for their projects. In our opinion, it increases the chances of a dataset to be reused: it's a win-win situation!

(Meme) Drakeposting about wild dataset descriptions vs. standardized dataset descriptions.

October 30, 2023 at 12:09 PM

Alix Chagué 🌈

@alix-tz.bsky.social

HTR-United actually relies on a single structured (YAML) text file gathering all the dataset descriptions. These descriptions are provided by the creators of the datasets, who take the time to write the description and submit it to us (more about that in a few skeets).

Screenshot of a an extract of the catalog file: it's just a YAML text file.

October 30, 2023 at 12:00 PM

Alix Chagué 🌈

@alix-tz.bsky.social

That's when HTR-United comes in! It is a catalog of descriptions redirecting to publicly available datasets of ground truth of different sizes, languages, writing systems, periods, etc.

We are based in France and work on French or Latin docs, it's visible in the current state of the catalog... 🤷

Graphic showing the distribution of scripts (or writing systems) across the datasets recorded in HTR-United as of the end of October 2023. The represented scripts are Latin (at least 70), Greek, Arabic, Devanaghari, Malayalam, Latin-Fraktur and Hebrew (less than 10 for all, aside from Latin).

Graphic showing the distribution of types of scripts across the datasets recorded in HTR-United as of the end of October 2023. +40 are tagged as "only manuscript", 25 as "only typed" (or printed), about 10 are "mainly manuscript" and less than 10 are "evenly mixed", meaning a even mix of printed and handwritten text.

Graphic showing the distribution of languages across the datasets recorded in HTR-United as of the end of October 2023.
The represented languages are, by decreasing order, French (+40), Latin (15), Middle French (10), German, English, Italian, Old French, Spanish, Ancient Greek, Arabic, Portuguese, Hindu, Sanskrit, Braj, Malayalam, Dutch, Middle High German, Catalan, Corsican, Hebrew and Czech.

Graphic showing the distribution of the data across time (800 to 2023) depending on the metric used (characters, files, lines, regions, pages, images). All the dataset do not provided information for all the possible metrics, but if we focus on the graph for the number of lines, we see that the period between 1600 and 1900 is well covered and that there is a good amount of data for the period from 1000 to 1450.

October 30, 2023 at 11:50 AM

Alix Chagué 🌈

@alix-tz.bsky.social

Ground truth for text recognition model takes the form of matching pairs of images and transcription. As in the examples bellow, an image is paired with the expected transcription (cf. ALT text). It can take hundreds or thousands of such pairs to train a transcription model.

A line of text taken from a typewritten document, saying "Having been invited by the committee of the Institut de"

Image of a text line written by hand in French where one can read "Je lui lègue en conséquence la" (which can translate to "I therefore bequeath to her the")

October 30, 2023 at 11:14 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news