Arkil Patel
@arkil.bsky.social
PhD Student at Mila and McGill | Research in ML and NLP | Past: AI2, MSFTResearch
arkilpatel.github.io
arkilpatel.github.io
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
February 21, 2025 at 4:29 PM
𝐍𝐨𝐭𝐞: Our work is a preliminary exploration into attempting to automatically generate high quality challenging benchmarks for LLMs. We discuss concrete limitations and huge scope for future work in the paper.
Results:
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
February 21, 2025 at 4:29 PM
Results:
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
- SOTA LLMs achieve 40-60% performance
- 𝐂𝐇𝐀𝐒𝐄 distinguishes between models well (as opposed to similar performances on standard benchmarks like GSM8k)
- While LLMs today have 128k-1M context sizes, 𝐂𝐇𝐀𝐒𝐄 shows they struggle to reason even at ~50k context size
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:
1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
February 21, 2025 at 4:29 PM
𝐂𝐇𝐀𝐒𝐄 uses 2 simple ideas:
1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
1. Bottom-up creation of complex context by “hiding” components of reasoning process
2. Decomposing generation pipeline into simpler, "soft-verifiable" sub-tasks
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:
1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
February 21, 2025 at 4:29 PM
𝐂𝐇𝐀𝐒𝐄 automatically generates challenging evaluation problems across 3 domains:
1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
1. 𝐂𝐇𝐀𝐒𝐄-𝐐𝐀: Long-context question answering
2. 𝐂𝐇𝐀𝐒𝐄-𝐂𝐨𝐝𝐞: Repo-level code generation
3. 𝐂𝐇𝐀𝐒𝐄-𝐌𝐚𝐭𝐡: Math reasoning
Why synthetic data for evaluation?
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
February 21, 2025 at 4:29 PM
Why synthetic data for evaluation?
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns
- Creating “hard” problems using humans is expensive (and may hit a limit soon!)
- Impractical for humans to annotate long-context data
- Other benefits: scalable, renewable, mitigate contamination concerns