Lightnews — Scholar-powered news

LMAnalysis

@mlanalysis.bsky.social

52 followers 810 following 15 posts

WIP- project dedicated to help demystify benchmarking of LLMs.

Posts Replies Media Videos

LMAnalysis

@mlanalysis.bsky.social

While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.

Gemini might just be the best publicly available general-purpose model right now!

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.

With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)

December 7, 2024 at 8:39 PM

LMAnalysis

@mlanalysis.bsky.social

On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.

It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)

December 7, 2024 at 8:39 PM

LMAnalysis

@mlanalysis.bsky.social

The new Gemini release from Google has mostly flown under the radar- perhaps understandably so.
🔮
Regaining the #1 spot on the lmarena.ai overall leaderboard feels like Google just finetuned their model for human preference again- but taking a closer look reveals truly remarkable performance... 🧵

LMArena categories overview. Gemini-exp-1206 is first or tied for first accross all categories.

December 7, 2024 at 8:39 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news