🗂️ Dataset: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
🗂️ Dataset: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
This project was a joint effort with amazing collaborators:
👥 @confusezius.bsky.social , Matthias Bethge, @zeynepakata.bsky.social , and @tomhartvigsen.bsky.social
Huge thanks to them for the ideas, feedback, and countless hours that made this work possible. 🙏
This project was a joint effort with amazing collaborators:
👥 @confusezius.bsky.social , Matthias Bethge, @zeynepakata.bsky.social , and @tomhartvigsen.bsky.social
Huge thanks to them for the ideas, feedback, and countless hours that made this work possible. 🙏
📘 Want to test your method at scale?
📄 Paper: arxiv.org/abs/2503.05683
🗂️ Benchmark: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
Let’s build LLMs that truly stay up to date. 🔄
Excited to see what the community does with this!
📘 Want to test your method at scale?
📄 Paper: arxiv.org/abs/2503.05683
🗂️ Benchmark: huggingface.co/datasets/luk...
💻 Code: github.com/ExplainableM...
Let’s build LLMs that truly stay up to date. 🔄
Excited to see what the community does with this!
🔍 TL;DR:
✅ We release WikiBigEdit - a new large-scale benchmark for real-world factual updates
🚨 Existing editing methods fail to scale
💡 Finetuning + merging is a surprisingly strong baseline
🧩 RAG wins - but with trade-offs
🔍 TL;DR:
✅ We release WikiBigEdit - a new large-scale benchmark for real-world factual updates
🚨 Existing editing methods fail to scale
💡 Finetuning + merging is a surprisingly strong baseline
🧩 RAG wins - but with trade-offs
Surprisingly, simple continual finetuning (LoRA) outperforms all editing baselines - at equal inference cost.
And when paired with model merging, performance improves even further over time.
💪 More scalable, more robust, and better retention across time steps.
Surprisingly, simple continual finetuning (LoRA) outperforms all editing baselines - at equal inference cost.
And when paired with model merging, performance improves even further over time.
💪 More scalable, more robust, and better retention across time steps.
RAG performs best overall - nearly tripling accuracy on edit and generalization tasks.
But:
⏳ It comes with significantly higher inference cost
🔄 And still struggles with multi-hop reasoning over updated facts
RAG performs best overall - nearly tripling accuracy on edit and generalization tasks.
But:
⏳ It comes with significantly higher inference cost
🔄 And still struggles with multi-hop reasoning over updated facts
The result? 📉
Most editing methods struggle at scale.
ROME and MEMIT collapse within a few hundred updates.
Even WISE, built for lifelong edits, degrades quickly - converging to pre-edit performance.
➡️ These techniques aren’t yet ready for real-world demands.
The result? 📉
Most editing methods struggle at scale.
ROME and MEMIT collapse within a few hundred updates.
Even WISE, built for lifelong edits, degrades quickly - converging to pre-edit performance.
➡️ These techniques aren’t yet ready for real-world demands.
We put popular editing methods to the test:
🔧 ROME, MEMIT, WISE
🔁 LoRA finetuning & merging
🔍 Retrieval-augmented generation (RAG)
How do they stack up on update accuracy, reasoning, generalization, and locality?
We put popular editing methods to the test:
🔧 ROME, MEMIT, WISE
🔁 LoRA finetuning & merging
🔍 Retrieval-augmented generation (RAG)
How do they stack up on update accuracy, reasoning, generalization, and locality?
Unlike synthetic edit datasets, WikiBigEdit tracks real-world knowledge changes over time.
It probes multi-hop reasoning, semantic generalization, and whether new edits interfere with existing knowledge.
And it’s built to continuously grow - for future-proof evaluation.
Unlike synthetic edit datasets, WikiBigEdit tracks real-world knowledge changes over time.
It probes multi-hop reasoning, semantic generalization, and whether new edits interfere with existing knowledge.
And it’s built to continuously grow - for future-proof evaluation.
📣 Introducing WikiBigEdit: a new benchmark for lifelong knowledge editing.
It includes:
📌 500K+ real-world QA pairs based on Wikidata
📆 8 time steps over 6 months (Feb–Jul 2024) and continuously updatable
🧪 Rich evaluations: reasoning, generalization, locality, …
📣 Introducing WikiBigEdit: a new benchmark for lifelong knowledge editing.
It includes:
📌 500K+ real-world QA pairs based on Wikidata
📆 8 time steps over 6 months (Feb–Jul 2024) and continuously updatable
🧪 Rich evaluations: reasoning, generalization, locality, …
Most LLMs are static snapshots of past knowledge.
But facts change constantly - and retraining is far too costly.
Knowledge editing offers a cheaper fix.
But how far can it actually take us?
We put it to the test - at realistic, deployment-scale.
Most LLMs are static snapshots of past knowledge.
But facts change constantly - and retraining is far too costly.
Knowledge editing offers a cheaper fix.
But how far can it actually take us?
We put it to the test - at realistic, deployment-scale.