📅 I’m presenting Thursday, July 31st at the TRL workshop
I’ll be around all week, so if you’re also interested in tabular learning/understanding and insight retrieval, feel free to reach out to me. I would be happy to connect! (4/4)
📅 I’m presenting Thursday, July 31st at the TRL workshop
I’ll be around all week, so if you’re also interested in tabular learning/understanding and insight retrieval, feel free to reach out to me. I would be happy to connect! (4/4)
🔹 BLEU/BERTScore? Not reliable for evaluating tabular QA capabilities
🔹 LLMs often struggle with missing values, duplicates, or structural alterations
🔹 We propose an LLM-as-a-judge method for a more realistic evaluation of the LLMs tabular reasoning capabilities (3/4)
🔹 BLEU/BERTScore? Not reliable for evaluating tabular QA capabilities
🔹 LLMs often struggle with missing values, duplicates, or structural alterations
🔹 We propose an LLM-as-a-judge method for a more realistic evaluation of the LLMs tabular reasoning capabilities (3/4)
"How well do LLMs reason over tabular data, really?" 📊
We dig into two important questions:
1️⃣ Are general-purpose LLMs robust with real-world tables?
2️⃣ How should we actually evaluate them? (2/4)
"How well do LLMs reason over tabular data, really?" 📊
We dig into two important questions:
1️⃣ Are general-purpose LLMs robust with real-world tables?
2️⃣ How should we actually evaluate them? (2/4)
And I am excited to keep building on this research!
📄 Paper link: arxiv.org/pdf/2505.07453
And I am excited to keep building on this research!
📄 Paper link: arxiv.org/pdf/2505.07453
Even on simple tasks like look-up, LLM performance drops significantly as table size increases.
And even on smaller tables, results leave plenty of room for improvement, highlighting major gaps in LLMs' understanding of tabular data and the need for more research on this topic.
Even on simple tasks like look-up, LLM performance drops significantly as table size increases.
And even on smaller tables, results leave plenty of room for improvement, highlighting major gaps in LLMs' understanding of tabular data and the need for more research on this topic.
Using this dataset and the LLM-as-a-judge, we tested the response accuracy to basic reasoning tasks like look-ups, subtractions, averages etc.
Using this dataset and the LLM-as-a-judge, we tested the response accuracy to basic reasoning tasks like look-ups, subtractions, averages etc.
🔍 The standard metrics? BLEU, BERTScore?
They fail to capture the correctness of the outputs given in this space.
So we introduced an alternative:
An LLM-as-a-judge to assess responses more reliably.
🔍 The standard metrics? BLEU, BERTScore?
They fail to capture the correctness of the outputs given in this space.
So we introduced an alternative:
An LLM-as-a-judge to assess responses more reliably.
But what happens when these two meet? Do LLMs actually understand tables, when they encounter them for example in a RAG pipeline?
Most benchmarks don’t test this well. So we decided to dig deeper.👇
But what happens when these two meet? Do LLMs actually understand tables, when they encounter them for example in a RAG pipeline?
Most benchmarks don’t test this well. So we decided to dig deeper.👇