Lightnews — Scholar-powered news

Cornelius Wolff

@cowolff.bsky.social

59 followers 250 following 10 posts

PhD Student at the TRL Lab at CWI, Amsterdam

Posts Replies Media Videos

Cornelius Wolff

@cowolff.bsky.social

🧪 Paper link: arxiv.org/pdf/2505.07453
📅 I’m presenting Thursday, July 31st at the TRL workshop

I’ll be around all week, so if you’re also interested in tabular learning/understanding and insight retrieval, feel free to reach out to me. I would be happy to connect! (4/4)

July 25, 2025 at 3:06 PM

Cornelius Wolff

@cowolff.bsky.social

Turns out:
🔹 BLEU/BERTScore? Not reliable for evaluating tabular QA capabilities
🔹 LLMs often struggle with missing values, duplicates, or structural alterations
🔹 We propose an LLM-as-a-judge method for a more realistic evaluation of the LLMs tabular reasoning capabilities (3/4)

July 25, 2025 at 3:06 PM

Cornelius Wolff

@cowolff.bsky.social

The paper's called:
"How well do LLMs reason over tabular data, really?" 📊

We dig into two important questions:
1️⃣ Are general-purpose LLMs robust with real-world tables?
2️⃣ How should we actually evaluate them? (2/4)

July 25, 2025 at 3:06 PM

Cornelius Wolff

@cowolff.bsky.social

Huge thanks to @madelonhulsebos.bsky.social for all the support on getting this work off the ground on such short notice after I started my PhD 🙏
And I am excited to keep building on this research!
📄 Paper link: arxiv.org/pdf/2505.07453

arxiv.org

May 28, 2025 at 10:03 AM

Cornelius Wolff

@cowolff.bsky.social

What did we find?
Even on simple tasks like look-up, LLM performance drops significantly as table size increases.
And even on smaller tables, results leave plenty of room for improvement, highlighting major gaps in LLMs' understanding of tabular data and the need for more research on this topic.

May 28, 2025 at 10:03 AM

Cornelius Wolff

@cowolff.bsky.social

Furthermore, we extended the existing TQA-Benchmark with some common data perturbations like Missing Values, Duplicates and Column Shuffling.
Using this dataset and the LLM-as-a-judge, we tested the response accuracy to basic reasoning tasks like look-ups, subtractions, averages etc.

May 28, 2025 at 10:03 AM

Cornelius Wolff

@cowolff.bsky.social

But only measuring if an answer from a LLM is actually correct turned out to be surprisingly tricky.
🔍 The standard metrics? BLEU, BERTScore?
They fail to capture the correctness of the outputs given in this space.
So we introduced an alternative:
An LLM-as-a-judge to assess responses more reliably.

May 28, 2025 at 10:03 AM

Cornelius Wolff

@cowolff.bsky.social

Tables are everywhere, but so are LLMs these days!
But what happens when these two meet? Do LLMs actually understand tables, when they encounter them for example in a RAG pipeline?
Most benchmarks don’t test this well. So we decided to dig deeper.👇

May 28, 2025 at 10:03 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news