Lightnews — Scholar-powered news

Reposted by Jan Heinrich Merker

Webis Group

@webis.de

We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use.

huggingface.co/datasets/coral-nlp/german-commons

coral-nlp/german-commons · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

October 27, 2025 at 12:45 PM

Reposted by Jan Heinrich Merker

Webis Group

@webis.de

Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_...
Nice to see axiomatic IR gaining momentum.

July 18, 2025 at 2:18 PM

Reposted by Jan Heinrich Merker

Webis Group

@webis.de

We presented two papers at ICTIR 2025 today:
- Axioms for Retrieval-Augmented Generation webis.de/publications...
- Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...

July 18, 2025 at 2:18 PM

Reposted by Jan Heinrich Merker

Webis Group

@webis.de

Happy to share that our paper "The Viability of Crowdsourcing for RAG Evaluation" received the Best Paper Honourable Mention at #SIGIR2025! Very grateful to the community for recognizing our work on improving RAG evaluation.

📄 webis.de/publications...

July 16, 2025 at 9:04 PM

Reposted by Jan Heinrich Merker

Timnit Gebru

@timnitgebru.bsky.social

Lets replace search with "AI" then! Totally logical if you ask me. Even more worth it when you know they're exponentially overtaking the airline industry in their carbon footprint.

Study: www.cjr.org/tow_center/w...

Tweet by Gisele Navarro @ichbinGisele

"“Collectively, the AI search engines provided incorrect answers to more than 60% of queries.” ~ @CJR

Screenoshot of an image showing "Generative search tools were often confidently wrong in our study" showing baplots of different offerings like ChatGPT, Perplexity and Deepseek, and showing the percentage of wrong answers in red and right ones in green.

March 12, 2025 at 9:34 PM

Reposted by Jan Heinrich Merker

The New York Times

@nytimes.com

Decades from now, the Covid-19 pandemic will be visible in the historical data of nearly anything measurable today. Here’s an incomplete collection of charts that capture that break — across the economy, health care, education, work, family life and more.

30 Charts That Show How Everything Changed in March 2020

It can be easy to forget, or look away from, the pain and disruption of the pandemic. The numbers will be there to remind us.

www.nytimes.com

March 10, 2025 at 7:24 AM

Reposted by Jan Heinrich Merker

Arjen P. de Vries Timmers 🕊️

@arjen.idf.social.ap.brid.gy

New preprint of WSDM demo by @maik_froebe @matthias and Ferdinand Schlatt

Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval https://arxiv.org/abs/2411.04677

https://webis.de/lightning-ir/

Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval

A wide range of transformer-based language models have been proposed for information retrieval tasks. However, including transformer-based models in retrieval pipelines is often complex and requires substantial engineering effort. In this paper, we introduce Lightning IR, an easy-to-use PyTorch Lightning-based framework for applying transformer-based language models in retrieval scenarios. Lightning IR provides a modular and extensible architecture that supports all stages of a retrieval pipeline: from fine-tuning and indexing to searching and re-ranking. Designed to be scalable and reproducible, Lightning IR is available as open-source: https://github.com/webis-de/lightning-ir.

arxiv.org

December 19, 2024 at 9:07 PM

Reposted by Jan Heinrich Merker

iroldie.bsky.social

@iroldie.bsky.social

What a team of keynote speakers. I must confess seeing that Steve Robertson will be there is a thrill. One of the legends of information retrieval reflecting on the field. #sigir2025

sigir2025.dei.unipd.it/keynote-spea...

SIGIR 2025, Padua, 13-18 July | Keynotes

The SIGIR 2025 keynotes are held by esteemed speakers: Robertson S., Gurevych I. and Frieder O., who will cover topics that range from AI in medical search and ecommendation to BM25 and probabilistic ...

sigir2025.dei.unipd.it

December 24, 2024 at 6:11 AM

Reposted by Jan Heinrich Merker

mrparryparry.bsky.social

@mrparryparry.bsky.social

🚨 New Pre-Print! 🚨 Reviewer 2 has once again asked for DL’19, what can you say in rebuttal? To help, we have re-annotated DL’19. Work done with @maik_froebe.bsky.social, @hscells.bsky.social, @fschlatt1.bsky.social, Guglielmo Faggioli, Saber Zerhoudi, @macavaney.bsky.social, Eugene Yang 🧵

March 3, 2025 at 10:18 AM

Reposted by Jan Heinrich Merker

arxiv cs.IR

@arxiv-cs-ir.bsky.social

Andrew Parry, Maik Fr\"obe, Harrisen Scells, Ferdinand Schlatt, Guglielmo Faggioli, Saber Zerhoudi, Sean MacAvaney, Eugene Yang
Variations in Relevance Judgments and the Shelf Life of Test Collections
https://arxiv.org/abs/2502.20937

March 3, 2025 at 5:32 AM

Reposted by Jan Heinrich Merker

Carl T. Bergstrom

@carlbergstrom.com

I'm putting together a slide illustrating how generative AI is being forced on people even though they don't want it, and this is sort of funny.

Here's are Google's autocomplete suggestions for "google gemini how to", and Bing's autocomplete suggestions for for "microsoft copilot how to".

Gemini: autocomplete suggestions include "how to turn off", "how to disable", and "how to remove".

Copilot: autocomplete suggestions include "how to disable", "how to delete", "how to uninstall", "how to turn off", and "how to remove."

March 2, 2025 at 6:41 PM

Jan Heinrich Merker

@heinrich.merker.id

Excited to have received 3k stars on GitHub! 🎉
github.com/janheinrichm...

Some stats:
⭐ 3,000 stars
🔀 579 forks
👁️ 413 followers
📚 108 public repositories
📖 101 open-source licensed
💾 3.7 GB of source code
✏️ 10,217 commits
💬 373 issues
🚀 233 pull requests

Thanks to all stargazers and followers! ☺️

December 2, 2024 at 10:49 PM

Jan Heinrich Merker

@heinrich.merker.id

Great first day of #TREC2024!
Especially the panel on evaluations of RAG approaches was very insightful 👍
Excited to see GenIR evaluation getting more and more solid 🙂

November 18, 2024 at 9:24 PM

Jan Heinrich Merker

@heinrich.merker.id

Actually, we already submitted some very similar approaches to TREC BioGen. Let's see how that plays out 😉

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

Follow-up on our #BIOASQ2024 submission: We actually submitted the best approach for some of the tasks 👍
Looking forward to further improving Medical RAG! #CLEF2024

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

https://www.theguardian.com/science/2024/feb/03/the-situation-has-become-appalling-fake-scientific-papers-push-research-credibility-to-crisis-point?CMP=Share_iOSApp_Other

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

Reminder that research that relies on OpenAI models is (usually) not reproducible.

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

📄 Pre-print: https://webis.de/publications.html?q=argument#reimer_2023b
💾 Code: https://t.co/FKq6Yx4cto

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

Hence, we propose better few-shot and zero-shot stance detectors based on GPT-3.5 and Flan-T5.
Our GPT-3.5 stance detector reaches an F1 of 0.49 and is able to push the top-3 systems of Touché to the top of the leaderboard ⏫

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

Our short paper “Stance-Aware Re-Ranking for Non-factual Comparative Queries” with @albondarenko2, @maik_froebe, and @matthias_hagen got accepted at #ArgMininig 2023 ☺️

Takeaway: Improve nDCG by moving docs that take no stance down the result list.

@ArgminingOrg #EMNLP #NLProc

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

With Vienna conveniently located at the center of the European rail network, I'm taking the night train via Berlin. Safe and sustainable travels, everyone! ☺️

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

Now I'm also headed to @essir_eu 🚆
I'm looking forward to a week full of hands-on IR courses – especially on Saturday, where there will be several lectures about health-related IR!
#essir2023

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

What a week 😄
1️⃣ Our resource paper is accepted at @SIGIRConf #sigir2023 📄🔍
2️⃣ I submitted my Master's thesis 🎉☺️
Now it's time for vacation until I join @webis_de as a PhD student in May 👍

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

The other direction is more problematic: how can we as researchers use models/code/ideas from companies that don't publish the underlying concepts? GPT-4 etc. are effectively black boxes and should therefore be used very carefully.

March 2, 2025 at 8:35 PM

Jan Heinrich Merker

@heinrich.merker.id

I don't agree. We're doing science for the public and that includes companies as well. If you don't want your models/code/ideas to be used by anyone, then write a patent instead of a paper. Or release your model weights and data under some less permissive license.

March 2, 2025 at 8:35 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news