LMAnalysis
mlanalysis.bsky.social
LMAnalysis
@mlanalysis.bsky.social
WIP- project dedicated to help demystify benchmarking of LLMs.
Indeed, unfortunately politics can't be treated like some vacuum that doesn't interact with anything else- especially when the body in power is stifling scientific research.

That being said, it does feel like there isn't much subject-specific academic discussion going on here, unlike old Twitter.
June 6, 2025 at 1:05 AM
Yeah, my position is that pure scaling has worked up till now at least. Yeah hallucinations still happen, but significant progress is being made nevertheless, for instance for reasoning. I do agree that new ideas need to be implemented, TTC alone isn't enough to address fundamental problems
February 20, 2025 at 5:00 AM
I kind of disagree, pure scaling has resulted in continued improvements in performance when stagnation could be expected. Just throwing pure compute at Grok resulted in a SOTA LLM, its problems seem to stem from core flaws of the transformer architecture.

Maybe Titans and LLaDA are on the horizon!
February 19, 2025 at 10:36 PM
While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.

Gemini might just be the best publicly available general-purpose model right now!
December 7, 2024 at 8:40 PM
And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)
December 7, 2024 at 8:40 PM
With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.

With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)
December 7, 2024 at 8:40 PM
With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)
December 7, 2024 at 8:39 PM
On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.

It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)
December 7, 2024 at 8:39 PM
Yeah it seems really cool, a munch more powerful finetuning method, but based on the sign up form seems quite selective.
Also it will only be rolled out next year
December 7, 2024 at 4:32 PM
📌
December 6, 2024 at 11:02 PM
📌
December 6, 2024 at 10:58 PM
A sneak peek of some topics I want to write about in the near future!
- Approaching benchmark limits: what comes next?
- Rankings and style control at LMArena: the 'hidden details' of the leaderboard
- o1's (not so) groundbreaking performance...
- Gemini's new is probably cooler than you think!
December 6, 2024 at 10:48 PM
My main focus for now will be benchmarks of LMSys/LMArena, as it has been widely adopted as the 'gold standard' of a model's performance for everyday use.
But I will also cover many other benchmarks to showcase their strengths and weaknesses at furthering our understanding of an LLM's performance.
December 6, 2024 at 10:48 PM