Lightnews — Scholar-powered news

Arkil Patel

@arkil.bsky.social

270 followers 400 following 9 posts

PhD Student at Mila and McGill | Research in ML and NLP | Past: AI2, MSFTResearch

arkilpatel.github.io

Posts Replies Media Videos

Arkil Patel

@arkil.bsky.social

Paper: arxiv.org/pdf/2502.14678

Data: tinyurl.com/chase-data

Code: github.com/McGill-NLP/C...

arxiv.org

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

Results:

- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:

1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:

1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning

February 21, 2025 at 4:29 PM

Arkil Patel

@arkil.bsky.social

Why synthetic data for evaluation?

- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns

February 21, 2025 at 4:29 PM

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news