Lightnews — Scholar-powered news

Wayne

@waynechi.bsky.social

33 followers 170 following 17 posts

CS Ph.D. at CMU. Building Copilot Arena. Editor at http://blog.ml.cmu.edu

Posts Replies Media Videos

Wayne

@waynechi.bsky.social

Interested in trying out Copilot Arena for yourself?
Download at lmarena.ai/copilot.
Follow for more updates!

Copilot Arena - Visual Studio Marketplace

Extension for Visual Studio Code - Code with and evaluate the latest LLMs and Code Completion models

lmarena.ai

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

Full Paper with additional analyses: arxiv.org/abs/2502.09328
Code: github.com/lmarena/copi...

w/ Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, @chrisdonahue.com , @atalwalkar.bsky.social

arxiv.org

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

Our paper analyzes human preferences across 10 SOTA coding models, but we continue to add more models to the live Copilot Arena leaderboard on lmarena.ai!

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

Different data slices affect user preferences disproportionally. There is a drastic difference in relative model performance between real-world tasks such as frontend or backend development versus leetcode style coding challenges but little difference between programming languages.

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

We attribute these differences to a significant shift in our data distribution. Compared to previous benchmarks, Copilot Arena observes more programming languages (PL), natural languages (NL), longer context lengths, multiple task types, and various code structures.

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

Our leaderboard differs from existing evaluations. In particular, smaller models over perform in static benchmarks compared to real development workflows.

March 5, 2025 at 4:49 PM

Wayne

@waynechi.bsky.social

We evaluate models in a developer's IDE by presenting pairs of code completions generated by two different models. This workflow evaluates human preferences across models with real users and tasks in their native environment.

March 5, 2025 at 4:49 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news