Thibault Clérice
banner
ponteineptique.bsky.social
Thibault Clérice
@ponteineptique.bsky.social
On vacation.
Digital humanists, loves python, making data, talking to data, reusing data.
Researcher @ ALMAnaCh, Inria Paris.
I have received a second fishing mail this morning, I have not seen warnings from @comphumresearch.bsky.social
November 13, 2025 at 7:07 AM
TranscriboQuest Arabic Team : "The issue is layout recognition. So we worked on that."
September 5, 2025 at 12:11 PM
Glosses in medieval manuscripts! Segmentation and transcription work
September 5, 2025 at 12:03 PM
Ancient Greek team worked on Papyri from Herculanum, where all Segmentation and transcription have to be started from scratch
September 5, 2025 at 11:59 AM
Catalog of books from a library in toulouse
September 5, 2025 at 11:53 AM
Medical recipes from the 17th centuries on manuscripts written primarily by women.
September 5, 2025 at 11:52 AM
Second presentation is a dataset for htr of 18th century Briton language manuscript
September 5, 2025 at 11:45 AM
End of the TranscriboQuest 2025, funded by @biblissima.bsky.social and @atrium-eu.bsky.social and we are starting to present each team datasets.
First, medieval vernaculars with German, Swedish, Irish, Spanish
September 5, 2025 at 11:41 AM
Médiévistes et autres collègues historien-ne-s, à la cathédrale d'Amiens, nous sommes tombés sur une inscription murale d'indulgences.
Sous le nom des pardonnés, des "surnoms" ou petites phrases sont visibles. Quel est leur sens / objectif ?
On en a trouvé des assez drôles du point de vue moderne
March 17, 2025 at 8:51 AM
3/4 Some scripts got a little more love, specifically Praegothica.
March 12, 2025 at 9:17 AM
2/4 Language that have more data, outside of Occitan, are:
- French
- Latin
- Italian and languages of Italy
- Spanish and languages of Spain
March 12, 2025 at 9:17 AM
Ma lecture, lente, de The Question c'est ma dernière claque. C'est lourd...
February 13, 2025 at 6:50 PM
We are very happy to publicly release the CATMuS Medieval dataset on @huggingface.bsky.social :

huggingface.co/datasets/CAT...

This dataset is unique in the space of HTR, as it includes more 160 000 lines of ground truth in 10 languages over 9 centuries (8-16 CE) in Latin scripts over 208 docs.
February 13, 2024 at 12:56 PM
7/ Out of domain examples just for fun :)
December 18, 2023 at 9:36 AM
4/ On out of domain manuscripts, it can performs above 90% accuracy regularly for Old French and Latin (sometimes with lower score of course) but it also performs fairly well on unknown languages.
December 18, 2023 at 9:23 AM
3/
The dataset has an important variation, in terms of genre, languages and script type. All with the same transcription practices, whether it's Middle Dutch, Old French, Navarese, Latin, etc.
December 18, 2023 at 9:21 AM
(11) After benchmarking on a test corpus, we looked at the analysis of Voicu and his predecessors. To summarize: our results mostly align with their. In figure below, the lower it is on y axis, and the darker it is, the most certain it is that a group of text is from a 1 author.
October 18, 2023 at 10:39 AM
(10) We went further than simply integrating SNR-D, as we also reused many tricks from Image Siamese Networks, namely we proposed an architecture using:
- SNR-D and SNR contrastive loss
- Pair miners (which try to make learning more efficient)
- Class Pooling
October 18, 2023 at 10:38 AM
(8) Manhattan outperformed a little SNR-D, but when we looked at the results, something did not feel right... Well, it's simple: Manhattan varied highly from one model to the other, which resulted in different results in the end.
October 18, 2023 at 10:38 AM
(7) The last one was found during a benchmark, and seemed in early experiments to yield better results. After training models over a set with known classes (namely, Church Fathers in Greek, see the appendix in the paper), we saw Manhattan and SNR-D at the top of the ranking !
October 18, 2023 at 10:38 AM
(6) ... if two things are of the same class or not. To do so, they usually use some form of distance, like in traditional stylometry. We tested three distances: - Manhattan (L1) - Euclidean (L2) - Signal-to-Noise Ratio Distance (SNR-D)
October 18, 2023 at 10:37 AM