Arkil Patel
banner
arkil.bsky.social
Arkil Patel
@arkil.bsky.social
PhD Student at Mila and McGill | Research in ML and NLP | Past: AI2, MSFTResearch

arkilpatel.github.io
arxiv.org
February 21, 2025 at 4:29 PM
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
February 21, 2025 at 4:29 PM
Results:

- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
February 21, 2025 at 4:29 PM
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
February 21, 2025 at 4:29 PM
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
February 21, 2025 at 4:29 PM
Why synthetic data for evaluation?

- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
February 21, 2025 at 4:29 PM