Lightnews — Scholar-powered news

Thibault Clérice

@ponteineptique.bsky.social

820 followers 230 following 180 posts

On vacation.
Digital humanists, loves python, making data, talking to data, reusing data.
Researcher @ ALMAnaCh, Inria Paris.

Posts Replies Media Videos

Thibault Clérice

@ponteineptique.bsky.social

I have received a second fishing mail this morning, I have not seen warnings from @comphumresearch.bsky.social

November 13, 2025 at 7:07 AM

Thibault Clérice

@ponteineptique.bsky.social

TranscriboQuest Arabic Team : "The issue is layout recognition. So we worked on that."

September 5, 2025 at 12:11 PM

Thibault Clérice

@ponteineptique.bsky.social

Glosses in medieval manuscripts! Segmentation and transcription work

September 5, 2025 at 12:03 PM

Thibault Clérice

@ponteineptique.bsky.social

Ancient Greek team worked on Papyri from Herculanum, where all Segmentation and transcription have to be started from scratch

September 5, 2025 at 11:59 AM

Thibault Clérice

@ponteineptique.bsky.social

Catalog of books from a library in toulouse

September 5, 2025 at 11:53 AM

Thibault Clérice

@ponteineptique.bsky.social

Medical recipes from the 17th centuries on manuscripts written primarily by women.

September 5, 2025 at 11:52 AM

Thibault Clérice

@ponteineptique.bsky.social

Second presentation is a dataset for htr of 18th century Briton language manuscript

September 5, 2025 at 11:45 AM

Thibault Clérice

@ponteineptique.bsky.social

End of the TranscriboQuest 2025, funded by @biblissima.bsky.social and @atrium-eu.bsky.social and we are starting to present each team datasets.
First, medieval vernaculars with German, Swedish, Irish, Spanish

September 5, 2025 at 11:41 AM

Thibault Clérice

@ponteineptique.bsky.social

Médiévistes et autres collègues historien-ne-s, à la cathédrale d'Amiens, nous sommes tombés sur une inscription murale d'indulgences.
Sous le nom des pardonnés, des "surnoms" ou petites phrases sont visibles. Quel est leur sens / objectif ?
On en a trouvé des assez drôles du point de vue moderne

March 17, 2025 at 8:51 AM

Thibault Clérice

@ponteineptique.bsky.social

3/4 Some scripts got a little more love, specifically Praegothica.

March 12, 2025 at 9:17 AM

Thibault Clérice

@ponteineptique.bsky.social

2/4 Language that have more data, outside of Occitan, are:
- French
- Latin
- Italian and languages of Italy
- Spanish and languages of Spain

March 12, 2025 at 9:17 AM

Thibault Clérice

@ponteineptique.bsky.social

Ma lecture, lente, de The Question c'est ma dernière claque. C'est lourd...

February 13, 2025 at 6:50 PM

Thibault Clérice

@ponteineptique.bsky.social

We are very happy to publicly release the CATMuS Medieval dataset on @huggingface.bsky.social :

huggingface.co/datasets/CAT...

This dataset is unique in the space of HTR, as it includes more 160 000 lines of ground truth in 10 languages over 9 centuries (8-16 CE) in Latin scripts over 208 docs.

February 13, 2024 at 12:56 PM

Thibault Clérice

@ponteineptique.bsky.social

7/ Out of domain examples just for fun :)

December 18, 2023 at 9:36 AM

Thibault Clérice

@ponteineptique.bsky.social

4/ On out of domain manuscripts, it can performs above 90% accuracy regularly for Old French and Latin (sometimes with lower score of course) but it also performs fairly well on unknown languages.

December 18, 2023 at 9:23 AM

Thibault Clérice

@ponteineptique.bsky.social

3/
The dataset has an important variation, in terms of genre, languages and script type. All with the same transcription practices, whether it's Middle Dutch, Old French, Navarese, Latin, etc.

December 18, 2023 at 9:21 AM

Thibault Clérice

@ponteineptique.bsky.social

(11) After benchmarking on a test corpus, we looked at the analysis of Voicu and his predecessors. To summarize: our results mostly align with their. In figure below, the lower it is on y axis, and the darker it is, the most certain it is that a group of text is from a 1 author.

October 18, 2023 at 10:39 AM

Thibault Clérice

@ponteineptique.bsky.social

(10) We went further than simply integrating SNR-D, as we also reused many tricks from Image Siamese Networks, namely we proposed an architecture using:
- SNR-D and SNR contrastive loss
- Pair miners (which try to make learning more efficient)
- Class Pooling

October 18, 2023 at 10:38 AM

Thibault Clérice

@ponteineptique.bsky.social

(8) Manhattan outperformed a little SNR-D, but when we looked at the results, something did not feel right... Well, it's simple: Manhattan varied highly from one model to the other, which resulted in different results in the end.

October 18, 2023 at 10:38 AM

Thibault Clérice

@ponteineptique.bsky.social

(7) The last one was found during a benchmark, and seemed in early experiments to yield better results. After training models over a set with known classes (namely, Church Fathers in Greek, see the appendix in the paper), we saw Manhattan and SNR-D at the top of the ranking !

October 18, 2023 at 10:38 AM

Thibault Clérice

@ponteineptique.bsky.social

(6) ... if two things are of the same class or not. To do so, they usually use some form of distance, like in traditional stylometry. We tested three distances: - Manhattan (L1) - Euclidean (L2) - Signal-to-Noise Ratio Distance (SNR-D)

October 18, 2023 at 10:37 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news