Lightnews — Scholar-powered news

Webis Group @webis.de · 15d

For full technical details + compliance Datasheet see our preprint @ arxiv.org/abs/2510.13996

As for German-specific models trained on this data... stay tuned 👀

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated f...

arxiv.org

1

Webis Group @webis.de · 15d

The data spans 7 text domains:
🌐 Web: Wikipedia, GitHub, social media
💬 Political: Parliamentary proceedings, speeches
⚖️ Legal: Court decisions, federal & EU law
📰 News: Newspaper archives
🏦 Economics: public tenders
📚 Cultural: Digital heritage collections
🔬 Scientific: Papers, books, journals

1 2

Webis Group @webis.de · 15d

This means:
✅ Every document has verifiable usage rights (min. CC-BY-SA 4.0 and allows commercial use)
✅ Full institutional provenance for reduced compliance risks
✅ Systematic PII removal + quality filtering, ready for training
✅ Rich metadata for downstream customization

1

Webis Group @webis.de · 15d

The current problem: training data is primarily sourced from Web crawls, which give you scale but unclear licensing. This blocks models from commercial deployment and research. We took a different path: systematically collecting German text from 41 institutional sources with explicit open licenses.

1

Webis Group @webis.de · 15d

We just released "German Commons", the largest openly-licensed German text dataset for LLM training: 154B tokens with clear usage rights for research and commercial use.

huggingface.co/datasets/coral-nlp/german-commons

coral-nlp/german-commons · Datasets at Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

1 9 20

Webis Group @webis.de · Jul 18

Congratulations to the authors @heinrich.merker.id, @maik-froebe.bsky.social, @benno-stein.de, @martin-potthast.com, @matthias-hagen.bsky.social from @uni-jena.de, Uni Weimar, @unikassel.bsky.social, @hessianai.bsky.social, @scadsai.bsky.social!

6

Webis Group @webis.de · Jul 18

Honored to win the ICTIR Best Paper Honorable Mention Award for "Axioms for Retrieval-Augmented Generation"!
Our new axioms are integrated with ir_axioms: github.com/webis-de/ir_...
Nice to see axiomatic IR gaining momentum.

1 6 16

Webis Group @webis.de · Jul 18

We presented two papers at ICTIR 2025 today:
- Axioms for Retrieval-Augmented Generation webis.de/publications...
- Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins webis.de/publications...

1 3 8

Webis Group @webis.de · Jul 18

Thrilled to announce that Matti Wiegmann has successfully defended his PhD! 🎉🧑‍🎓 Huge congratulations on this incredible achievement! #PhDDefense #AcademicMilestone

2 3 13

Webis Group @webis.de · Jul 16

Congrats to the authors @lgnp.bsky.social @timhagen.bsky.social @maik-froebe.bsky.social @matthias-hagen.bsky.social @benno-stein.de @martin-potthast.com @hscells.bsky.social from @unikassel.bsky.social @hessianai.bsky.social @scadsai.bsky.social @unituebingen.bsky.social @uni-jena.de & Uni Weimar

1 7

Webis Group @webis.de · Jul 16

Happy to share that our paper "The Viability of Crowdsourcing for RAG Evaluation" received the Best Paper Honourable Mention at #SIGIR2025! Very grateful to the community for recognizing our work on improving RAG evaluation.

📄 webis.de/publications...

2 10 27

Reposted by Webis Group

Maik Fröbe @maik-froebe.bsky.social · Jun 27

Do not forget to participate in the #TREC2025 Tip-of-the-Tongue (ToT) Track :)

The corpus and baselines (with run files) are now available and easily accessible via the ir_datasets API and the HuggingFace Datasets API.

More details are available at: trec-tot.github.io/guidelines

Dory from finding nemo with the quote: "I remember it like it was yesterday. Of course, I dont remember yesterday."

7 11

Webis Group @webis.de · Jun 22

Congratulations to the authors @lgnp.bsky.social @deckersniklas.bsky.social @martin-potthast.com @hscells.bsky.social !

📄 Preprint: arxiv.org/abs/2407.21515
💻 Code: github.com/webis-de/ada...

Learning Effective Representations for Retrieval Using Self-Distillation with Adaptive Relevance Margins

Representation-based retrieval models, so-called biencoders, estimate the relevance of a document to a query by calculating the similarity of their respective embeddings. Current state-of-the-art bien...

arxiv.org

1 1

Webis Group @webis.de · Jun 22

Results on BEIR demonstrate that our method matches teacher distillation effectiveness, while using only 13.5% of the data and achieving 3-15x training speedup. This makes effective bi-encoder training more accessible, especially for low-resource settings.

1 1

Webis Group @webis.de · Jun 22

The key idea: we can use the similarity predicted by the encoder itself between positive and negative documents to scale a traditional margin loss. This performs implicit hard negative mining and is hyperparameter-free.

1 1

Webis Group @webis.de · Jun 22

Our paper on self-distillation for training bi-encoders got accepted at #ICTIR2025! By exploiting pretrained encoder capabilities, our approach eliminates expensive teacher models and batch sampling while maintaining the same effectiveness.

1 3 6

Webis Group @webis.de · Jun 2

…human texts today, contextualize the findings in terms of our theoretical contribution, and use them to make an assessment of the quality and adequacy of existing LLM detection benchmarks, which tend to be constructed with authorship attribution in mind, rather than authorship verification. 3/3

Webis Group @webis.de · Jun 2

…limits of the field. We argue that as LLMs improve, detection will not necessarily become impossible, but it will be limited by the capabilities and theoretical boundaries of the field of authorship verification.

We conduct a series of exploratory analyses to show how LLM texts differ from… 2/3

1 1

Webis Group @webis.de · Jun 2

Our paper titled “The Two Paradigms of LLM Detection: Authorship Attribution vs. Authorship Verification” has been accepted to #ACL2025 (Findings). downloads.webis.de/publications...

We discuss why LLM detection is a one-class problem and how that affects the prospective… 1/3 #ACL #NLP #ARR #LLM

The first page of our paper "The Two Paradigms of LLM Detection: Authorship Attribution vs. Authorship Verification"

Figure 1 (showing entropy curves for LLM texts by model on the PAN'24, RAID, and M4 datasets): Mean character 3-gram entropy over increasing text length with 95 % confidence intervals. Shown are texts from the (a) PAN’24, (b) RAID, and (c) M4 datasets. Curves diverge after around 2,500–4,000 characters. LLM entropy is consistently lower than human entropy, except for GPT-4o, OpenAI o1, and BLOOMz-176b.

Figure 3 (showing unmasking curves for top 250 and top 500 features for Llama2-70b, GPT-3.5, GPT-4o, OpenAI o1): Median authorship unmasking curves using the 250 (top row) or 500 (bottom row) most-frequent character 3-grams for 200 Human / Human (same in all graphs), LLM / LLM, and Human / LLM text pairs for selected models drawn from the extended PAN’24 dataset. The shaded areas indicate the 50 % IQR. Llama2 and GPT-3.5 are very inconsistent by being unnaturally discriminable in the top 250 alone and yet very self-similar in the top 500 3-grams. GPT-4o and, particularly, OpenAI o1 are more consistent by being more similar to themselves in both feature sets than the median of human text pairs and about as dissimilar to human texts as other human texts would be.

1 1 9

Reposted by Webis Group

Webis Group @webis.de · Mar 5

PAN 2025 Call for Participation: Shared Tasks on Authorship Analysis, Computational Ethics, and Originality

We'd like to invite you to participate in the following shared tasks at PAN 2025 held in conjunction with the CLEF conference in Madrid, Spain.

Find out more at pan.webis.de/clef25/pan25...

pan.webis.de

1 7 9

Webis Group @webis.de · Apr 30

🧵 4/4 The shared task continues the research on LLM-based advertising. Participants can submit systems for two sub-tasks: First, generate responses with and without ads. Second, classify whether a response contains an ad.
Submissions are open until May 10th and we look forward to your contributions.

1 2

Webis Group @webis.de · Apr 30

🧵 3/4 In a lot of cases, survey participants did not notice brand or product placements in the responses. As a first step towards ad-blockers for LLMs, we created a dataset of responses with and without ads and trained classifiers on the task of identifying the ads.
dl.acm.org/doi/10.1145/...

1 1 3

Webis Group @webis.de · Apr 30

🧵 2/4 Given the high operating costs of LLMs, they require a business model to sustain them and advertising is a natural candidate.
Hence, we have analyzed how well LLMs can blend product placements with "organic" responses and whether users are able to identify the ads.
dl.acm.org/doi/10.1145/...

1 1 2

Webis Group @webis.de · Apr 30

Can LLM-generated ads be blocked? With OpenAI adding shopping options to ChatGPT, this question gains further importance.
If you are interested in contributing to the research on LLM-based advertising, please check out our shared task: touche.webis.de/clef25/touch...

More details below.

1 5 8

Webis Group @webis.de · Apr 7

🧵 4/4 Credit and thanks to the author team @lgnp.bsky.social @timhagen.bsky.social @maik-froebe.bsky.social @matthias-hagen.bsky.social @benno-stein.de @martin-potthast.com @hscells.bsky.social – you can also catch some of them at #ECIR2025 currently if you want to chat about RAG!

4