Lukas Thede
banner
lukasthede.bsky.social
Lukas Thede
@lukasthede.bsky.social
IMPRS-IS PhD Student with Zeynep Akata and Matthias Bethge at the University of Tübingen and Helmholtz Munich, working on continually adapting foundation models.
10/
This project was a joint effort with amazing collaborators:
👥 @confusezius.bsky.social , Matthias Bethge, @zeynepakata.bsky.social , and @tomhartvigsen.bsky.social
Huge thanks to them for the ideas, feedback, and countless hours that made this work possible. 🙏
April 8, 2025 at 3:32 PM
9/
📘 Want to test your method at scale?
📄 Paper: arxiv.org/abs/2503.05683
🗂️ Benchmark: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
Let’s build LLMs that truly stay up to date. 🔄
Excited to see what the community does with this!
Understanding the Limits of Lifelong Knowledge Editing in LLMs
Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested o...
arxiv.org
April 8, 2025 at 3:32 PM
8/
🔍 TL;DR:
✅ We release WikiBigEdit - a new large-scale benchmark for real-world factual updates
🚨 Existing editing methods fail to scale
💡 Finetuning + merging is a surprisingly strong baseline
🧩 RAG wins - but with trade-offs
April 8, 2025 at 3:32 PM
7/
Surprisingly, simple continual finetuning (LoRA) outperforms all editing baselines - at equal inference cost.
And when paired with model merging, performance improves even further over time.
💪 More scalable, more robust, and better retention across time steps.
April 8, 2025 at 3:32 PM
6/
RAG performs best overall - nearly tripling accuracy on edit and generalization tasks.
But:
⏳ It comes with significantly higher inference cost
🔄 And still struggles with multi-hop reasoning over updated facts
April 8, 2025 at 3:32 PM
5/
The result? 📉
Most editing methods struggle at scale.
ROME and MEMIT collapse within a few hundred updates.
Even WISE, built for lifelong edits, degrades quickly - converging to pre-edit performance.
➡️ These techniques aren’t yet ready for real-world demands.
April 8, 2025 at 3:32 PM
4/
We put popular editing methods to the test:
🔧 ROME, MEMIT, WISE
🔁 LoRA finetuning & merging
🔍 Retrieval-augmented generation (RAG)

How do they stack up on update accuracy, reasoning, generalization, and locality?
April 8, 2025 at 3:32 PM
3/
Unlike synthetic edit datasets, WikiBigEdit tracks real-world knowledge changes over time.

It probes multi-hop reasoning, semantic generalization, and whether new edits interfere with existing knowledge.
And it’s built to continuously grow - for future-proof evaluation.
April 8, 2025 at 3:32 PM
2/
📣 Introducing WikiBigEdit: a new benchmark for lifelong knowledge editing.

It includes:
📌 500K+ real-world QA pairs based on Wikidata
📆 8 time steps over 6 months (Feb–Jul 2024) and continuously updatable
🧪 Rich evaluations: reasoning, generalization, locality, …
April 8, 2025 at 3:32 PM
1/
Most LLMs are static snapshots of past knowledge.
But facts change constantly - and retraining is far too costly.
Knowledge editing offers a cheaper fix.
But how far can it actually take us?
We put it to the test - at realistic, deployment-scale.
April 8, 2025 at 3:32 PM