Jan Heinrich Merker
banner
heinrich.merker.id
Jan Heinrich Merker
@heinrich.merker.id
📚 Researcher • 💻 Developer • 🇪🇺 European
PhD student for health-related information retrieval at @uni-jena.de × @webis.de
Reposted by Jan Heinrich Merker
We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use.

huggingface.co/datasets/coral-nlp/german-commons
coral-nlp/german-commons · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co
October 27, 2025 at 12:45 PM
Reposted by Jan Heinrich Merker
Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_...
Nice to see axiomatic IR gaining momentum.
July 18, 2025 at 2:18 PM
Reposted by Jan Heinrich Merker
We presented two papers at ICTIR 2025 today:
- Axioms for Retrieval-Augmented Generation webis.de/publications...
- Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...
July 18, 2025 at 2:18 PM
Reposted by Jan Heinrich Merker
Happy to share that our paper "The Viability of Crowdsourcing for RAG Evaluation" received the Best Paper Honourable Mention at #SIGIR2025! Very grateful to the community for recognizing our work on improving RAG evaluation.

 📄 webis.de/publications...
July 16, 2025 at 9:04 PM
Reposted by Jan Heinrich Merker
Lets replace search with "AI" then! Totally logical if you ask me. Even more worth it when you know they're exponentially overtaking the airline industry in their carbon footprint.

Study: www.cjr.org/tow_center/w...
March 12, 2025 at 9:34 PM
Reposted by Jan Heinrich Merker
Decades from now, the Covid-19 pandemic will be visible in the historical data of nearly anything measurable today. Here’s an incomplete collection of charts that capture that break — across the economy, health care, education, work, family life and more.
30 Charts That Show How Everything Changed in March 2020
It can be easy to forget, or look away from, the pain and disruption of the pandemic. The numbers will be there to remind us.
www.nytimes.com
March 10, 2025 at 7:24 AM
Reposted by Jan Heinrich Merker
Reposted by Jan Heinrich Merker
What a team of keynote speakers. I must confess seeing that Steve Robertson will be there is a thrill. One of the legends of information retrieval reflecting on the field. #sigir2025

sigir2025.dei.unipd.it/keynote-spea...
SIGIR 2025, Padua, 13-18 July | Keynotes
The SIGIR 2025 keynotes are held by esteemed speakers: Robertson S., Gurevych I. and Frieder O., who will cover topics that range from AI in medical search and ecommendation to BM25 and probabilistic ...
sigir2025.dei.unipd.it
December 24, 2024 at 6:11 AM
Reposted by Jan Heinrich Merker
🚨 New Pre-Print! 🚨 Reviewer 2 has once again asked for DL’19, what can you say in rebuttal?  To help, we have re-annotated DL’19. Work done with @maik_froebe.bsky.social, @hscells.bsky.social, @fschlatt1.bsky.social, Guglielmo Faggioli, Saber Zerhoudi, @macavaney.bsky.social, Eugene Yang 🧵
March 3, 2025 at 10:18 AM
Reposted by Jan Heinrich Merker
Andrew Parry, Maik Fr\"obe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Faggioli, Saber Zerhoudi, Sean MacAvaney, Eugene Yang
Variations in Relevance Judgments and the Shelf Life of Test Collections
https://arxiv.org/abs/2502.20937
March 3, 2025 at 5:32 AM
Reposted by Jan Heinrich Merker
I'm putting together a slide illustrating how generative AI is being forced on people even though they don't want it, and this is sort of funny.

Here's are Google's autocomplete suggestions for "google gemini how to", and Bing's autocomplete suggestions for for "microsoft copilot how to".
March 2, 2025 at 6:41 PM
Excited to have received 3k stars on GitHub! 🎉
github.com/janheinrichm...

Some stats:
⭐ 3,000 stars
🔀 579 forks
👁️ 413 followers
📚 108 public repositories
📖 101 open-source licensed
💾 3.7 GB of source code
✏️ 10,217 commits
💬 373 issues
🚀 233 pull requests

Thanks to all stargazers and followers! ☺️
December 2, 2024 at 10:49 PM
Great first day of #TREC2024!
Especially the panel on evaluations of RAG approaches was very insightful 👍
Excited to see GenIR evaluation getting more and more solid 🙂
November 18, 2024 at 9:24 PM
Actually, we already submitted some very similar approaches to TREC BioGen. Let's see how that plays out 😉
March 2, 2025 at 8:35 PM
Follow-up on our #BIOASQ2024 submission: We actually submitted the best approach for some of the tasks 👍
Looking forward to further improving Medical RAG! #CLEF2024
March 2, 2025 at 8:35 PM
https://www.theguardian.com/science/2024/feb/03/the-situation-has-become-appalling-fake-scientific-papers-push-research-credibility-to-crisis-point?CMP=Share_iOSApp_Other
March 2, 2025 at 8:35 PM
Reminder that research that relies on OpenAI models is (usually) not reproducible.
March 2, 2025 at 8:35 PM
📄 Pre-print: https://webis.de/publications.html?q=argument#reimer_2023b
💾 Code: https://t.co/FKq6Yx4cto
March 2, 2025 at 8:35 PM
Hence, we propose better few-shot and zero-shot stance detectors based on GPT-3.5 and Flan-T5.
Our GPT-3.5 stance detector reaches an F1 of 0.49 and is able to push the top-3 systems of Touché to the top of the leaderboard ⏫
March 2, 2025 at 8:35 PM
Our short paper “Stance-Aware Re-Ranking for Non-factual Comparative Queries” with @albondarenko2, @maik_froebe, and @matthias_hagen got accepted at #ArgMininig 2023 ☺️

Takeaway: Improve nDCG by moving docs that take no stance down the result list.

@ArgminingOrg #EMNLP #NLProc
March 2, 2025 at 8:35 PM
With Vienna conveniently located at the center of the European rail network, I'm taking the night train via Berlin. Safe and sustainable travels, everyone! ☺️
March 2, 2025 at 8:35 PM
Now I'm also headed to @essir_eu 🚆
I'm looking forward to a week full of hands-on IR courses – especially on Saturday, where there will be several lectures about health-related IR!
#essir2023
March 2, 2025 at 8:35 PM
What a week 😄
1️⃣ Our resource paper is accepted at @SIGIRConf #sigir2023 📄🔍
2️⃣ I submitted my Master's thesis 🎉☺️
Now it's time for vacation until I join @webis_de as a PhD student in May 👍
March 2, 2025 at 8:35 PM
The other direction is more problematic: how can we as researchers use models/code/ideas from companies that don't publish the underlying concepts? GPT-4 etc. are effectively black boxes and should therefore be used very carefully.
March 2, 2025 at 8:35 PM
I don't agree. We're doing science for the public and that includes companies as well. If you don't want your models/code/ideas to be used by anyone, then write a patent instead of a paper. Or release your model weights and data under some less permissive license.
March 2, 2025 at 8:35 PM