LMAnalysis
mlanalysis.bsky.social
LMAnalysis
@mlanalysis.bsky.social
WIP- project dedicated to help demystify benchmarking of LLMs.
While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.

Gemini might just be the best publicly available general-purpose model right now!
December 7, 2024 at 8:40 PM
And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)
December 7, 2024 at 8:40 PM
With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.

With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)
December 7, 2024 at 8:40 PM
With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)
December 7, 2024 at 8:39 PM
On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.

It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)
December 7, 2024 at 8:39 PM
The new Gemini release from Google has mostly flown under the radar- perhaps understandably so.
🔮
Regaining the #1 spot on the lmarena.ai overall leaderboard feels like Google just finetuned their model for human preference again- but taking a closer look reveals truly remarkable performance... 🧵
December 7, 2024 at 8:39 PM