Instead of artificially difficult exams where progress ≠ value, we assess LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far:
Instead of artificially difficult exams where progress ≠ value, we assess LLMs on organic, unsolved problems via reference-free LLM validation & community verification. LLMs solved ~10/500 so far: