Wayne
banner
waynechi.bsky.social
Wayne
@waynechi.bsky.social
CS Ph.D. at CMU. Building Copilot Arena. Editor at http://blog.ml.cmu.edu
Our paper analyzes human preferences across 10 SOTA coding models, but we continue to add more models to the live Copilot Arena leaderboard on lmarena.ai!
March 5, 2025 at 4:49 PM
We attribute these differences to a significant shift in our data distribution. Compared to previous benchmarks, Copilot Arena observes more programming languages (PL), natural languages (NL), longer context lengths, multiple task types, and various code structures.
March 5, 2025 at 4:49 PM
Our leaderboard differs from existing evaluations. In particular, smaller models over perform in static benchmarks compared to real development workflows.
March 5, 2025 at 4:49 PM
We evaluate models in a developer's IDE by presenting pairs of code completions generated by two different models. This workflow evaluates human preferences across models with real users and tasks in their native environment.
March 5, 2025 at 4:49 PM
What do developers 𝘳𝘦𝘢𝘭𝘭𝘺 think of AI coding assistants?

In October, we launched Copilot Arena to collect user preferences on real dev workflows. After months of live service, we’re here to share our findings in our recent preprint.

Here's what we have learned /🧵
March 5, 2025 at 4:49 PM
Got to test out InceptionAILab's newest model, Mercury Coder Mini, on Copilot Arena!

Mercury Coder Mini is blazing fast and overtakes Codestral as the fastest coding model out there (0.24s end-to-end latency) while boasting similar performance (joint #2).

Congrats to InceptionAILabs! 📸
February 26, 2025 at 11:51 PM
I'm not physically at NeurIPS, but my good friend
@naveenraman.bsky.social will be presenting in my stead.

In this work, we found that UI element ordering significantly affected GUI agent performance. Come check out the poster (and quiz Naveen) at the Workshop on Open-World Agents (OWA-2024)!
December 13, 2024 at 7:32 AM
Bruh what... 💀
December 10, 2024 at 5:51 PM
We've open sourced CopilotArena’s server code!

Check out how we handle code completions and share your ideas for new system prompts!

Github:
github.com/lmarena/copi...
Technical details in the blog: blog.lmarena.ai/blog/2024/co...

Download Copilot now at: lmarena.ai/copilot
December 5, 2024 at 7:44 PM