Lightnews — Scholar-powered news

Ofir Press

@ofirpress.bsky.social

650 followers 380 following 24 posts

I develop tough benchmarks for LMs and then I build agents to try and beat those benchmarks. Postdoc @ Princeton University.

https://ofir.io/about

Posts Replies Media Videos

Ofir Press

@ofirpress.bsky.social

Do language models have algorithmic creativity?

To find out, we built AlgoTune, a benchmark challenging agents to optimize 100+ algorithms like gzip compression, AES encryption and PCA. Frontier models struggle, finding only surface-level wins. Lots of headroom here!
algotune.io

July 2, 2025 at 2:36 PM

Ofir Press

@ofirpress.bsky.social

I have a post where I talk about how to build good LM benchmarks. I've had to edit the part where I talk about how I think you should try to make your benchmark hard, multiple times now, since LM abilities are accelerating so rapidly.

May 11, 2025 at 9:25 PM

Ofir Press

@ofirpress.bsky.social

I prompted Claude 3.7 to use Javascript to animate a ride in the Kingda Ka rollercoaster at Six Flags in New Jersey. I did not give it any images/videos from the ride, or any other additional info, and it has no web access.

March 4, 2025 at 3:42 PM

Ofir Press

@ofirpress.bsky.social

SWE-agent 1.0 is the open-source SOTA on SWE-bench Lite! Tons of new features: massively parallel runs; cloud-based deployment; extensive configurability with tool bundles; new command line interface & utilities.
github.com/swe-agent/sw...

February 13, 2025 at 3:37 PM

Ofir Press

@ofirpress.bsky.social

SWE-bench Multimodal evaluation code is out now!

SWE-bench MM is a new set of JavaScript issues that have a visual component (‘map isn’t rendering correctly’, ‘button text isn’t appearing’).

www.swebench.com/sb-cli/

January 17, 2025 at 9:06 AM

Ofir Press

@ofirpress.bsky.social

We just updated the SWE-bench leaderboard with *14* new submissions! Congrats to blackbox.ai & aide.dev on setting new SoTA scores on Lite & Verified!
Congrats to Google on your first submission!

Credit to John Yang, Carlos E. Jimeneze & Kilian Lieret who maintain SWE-bench/SWE-agent

January 8, 2025 at 8:31 PM

Ofir Press

@ofirpress.bsky.social

When we started working on SWE-agent the top score on SWE-bench was 2%. I told the team that if we got 6%, we'd have a good paper, and I'd buy everyone gelato.

I thought 6% was very ambitious but doable.
We ended up getting 12% 🤯 so I cooked dinner for everyone.

January 2, 2025 at 10:05 AM

Ofir Press

@ofirpress.bsky.social

We're presenting SWE-agent tomorrow (Wed) at the 11AM poster session, East Exhibit Hall A-C #1000.

We're going to talk about a lot of upcoming SWE-agent features. Join @jyangballin @_carlosejimenez @KLieret and me. I also have a bunch of SWE-agent stickers to hand out :)

December 10, 2024 at 6:16 PM

Ofir Press

@ofirpress.bsky.social

I'm on the academic job market!
I develop autonomous systems for: programming, research-level question answering, finding sec vulnerabilities & other useful+challenging tasks.
I do this by building frontier-pushing benchmarks and agents that do well on them.
See you at NeurIPS!

December 4, 2024 at 4:52 PM

Ofir Press

@ofirpress.bsky.social

Super cool work from Daniel Geng: "What happens when you train a video generation model to be conditioned on motion?
Turns out you can perform "motion prompting," just like you might prompt an LLM! Doing so enables many different capabilities."
motion-prompting.github.io

December 4, 2024 at 3:58 AM

Ofir Press

@ofirpress.bsky.social

Cool benchmark I found through Twitter. (I am not involved in this work) scalingintelligence.stanford.edu/blogs/kernel...

December 3, 2024 at 11:17 PM

Ofir Press

@ofirpress.bsky.social

Lots of new SWEbench.com results have just been posted. Congrats everyone on the amazing results!

December 3, 2024 at 10:08 PM

Ofir Press

@ofirpress.bsky.social

The Amazon thing that summarizes all of the reviews into both a sentence and a bullet point list is so awesome and I wish I could figure out a way to make a benchmark out of it :)

December 1, 2024 at 5:18 AM

Ofir Press

@ofirpress.bsky.social

Quantized SWE-bench coming soon?

November 28, 2024 at 8:04 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news