Former researcher at Stanford University
Llama 4 Maverick is top 4 all@1 on Time Complexity Generation and top 2🥈coeffFull on Time Complexity Ranking (beating R1, though not using any reasoning tokens).
The model is less performant on Space Complexity.
👇All links below👇
Llama 4 Maverick is top 4 all@1 on Time Complexity Generation and top 2🥈coeffFull on Time Complexity Ranking (beating R1, though not using any reasoning tokens).
The model is less performant on Space Complexity.
👇All links below👇
3 models added to our benchmark:
🏆 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
🧑💻 agentica-org/DeepCoder-14B-Preview
🤲 all-hands/openhands-lm-32b-v0.1
Thanks @vllm_project and @huggingface for quickly supporting inference!
👇All links below👇
3 models added to our benchmark:
🏆 nvidia/Llama-3_1-Nemotron-Ultra-253B-v1
🧑💻 agentica-org/DeepCoder-14B-Preview
🤲 all-hands/openhands-lm-32b-v0.1
Thanks @vllm_project and @huggingface for quickly supporting inference!
👇All links below👇
✨3,105 coding problems and 1,190,250 solutions from CodeContests
✨Time/Space Complexity labels and curve coefficients
✨Up to 5k Runtime/Memory Footprint measures for each solution
huggingface.co/datasets/fac...
✨3,105 coding problems and 1,190,250 solutions from CodeContests
✨Time/Space Complexity labels and curve coefficients
✨Up to 5k Runtime/Memory Footprint measures for each solution
huggingface.co/datasets/fac...
All these models have similar active parameters, DeepSeekV3-0324 being MoE with 37B active parameters.
Whereas DeepSeekR1 and QwQ use reasoning tokens (and therefore way more inference tokens), Gemma3 and DeepSeekV3-0324 directly output the result.
🧵2/6
All these models have similar active parameters, DeepSeekV3-0324 being MoE with 37B active parameters.
Whereas DeepSeekR1 and QwQ use reasoning tokens (and therefore way more inference tokens), Gemma3 and DeepSeekV3-0324 directly output the result.
🧵2/6
any human programmer, usually accustomed to easily
finding non-optimized solutions, but struggling at the
best ones.
any human programmer, usually accustomed to easily
finding non-optimized solutions, but struggling at the
best ones.
Token-space reasoning models perform best !
Token-space reasoning models perform best !
The framework ran on ~1M Code Contests solutions 👉 Data is public too!
Lastly, we designed test sets and evaluated LLMs 👉 Leaderboard is out!
The framework ran on ~1M Code Contests solutions 👉 Data is public too!
Lastly, we designed test sets and evaluated LLMs 👉 Leaderboard is out!
We investigate the performance of LLMs on 3 tasks:
✅ Time/Space Complexity Prediction
✅ Time/Space Complexity Generation
✅ Time/Space Complexity Ranking
We investigate the performance of LLMs on 3 tasks:
✅ Time/Space Complexity Prediction
✅ Time/Space Complexity Generation
✅ Time/Space Complexity Ranking
BigO(Bench) evaluates high-level reasoning skills in coding, revealing that top-scoring models on Code Contests often struggle when required to both write and reason about their code.
Extra-Space Complexity seems particularly challenging !
BigO(Bench) evaluates high-level reasoning skills in coding, revealing that top-scoring models on Code Contests often struggle when required to both write and reason about their code.
Extra-Space Complexity seems particularly challenging !
Introducing our new non-saturated (for at least the coming week? 😉) benchmark:
✨BigO(Bench)✨ - Can LLMs Generate Code with Controlled Time and Space Complexity?
Check out the details below !👇
Introducing our new non-saturated (for at least the coming week? 😉) benchmark:
✨BigO(Bench)✨ - Can LLMs Generate Code with Controlled Time and Space Complexity?
Check out the details below !👇