Lightnews — Scholar-powered news

LMAnalysis

@mlanalysis.bsky.social

Indeed, unfortunately politics can't be treated like some vacuum that doesn't interact with anything else- especially when the body in power is stifling scientific research.

That being said, it does feel like there isn't much subject-specific academic discussion going on here, unlike old Twitter.

June 6, 2025 at 1:05 AM

LMAnalysis

@mlanalysis.bsky.social

Yeah, my position is that pure scaling has worked up till now at least. Yeah hallucinations still happen, but significant progress is being made nevertheless, for instance for reasoning. I do agree that new ideas need to be implemented, TTC alone isn't enough to address fundamental problems

February 20, 2025 at 5:00 AM

LMAnalysis

@mlanalysis.bsky.social

I kind of disagree, pure scaling has resulted in continued improvements in performance when stagnation could be expected. Just throwing pure compute at Grok resulted in a SOTA LLM, its problems seem to stem from core flaws of the transformer architecture.

Maybe Titans and LLaDA are on the horizon!

February 19, 2025 at 10:36 PM

LMAnalysis

@mlanalysis.bsky.social

While we don't have many other benchmarks for it yet, Livebench.ai 's low-contamination benchmark is putting Gemini in second. It isn't quite o1-level for reasoning and language- but pulls ahead on math and coding.

Gemini might just be the best publicly available general-purpose model right now!

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

And this story is repeated across the board. In fact, in every single category and language, Gemini is first place or tied for first! I don't know if there has been such a universally strong general-purpose model release since the original GPT-4. (5)

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

With a +45 point leap over its predecessor, Gemini takes a commanding +29 point lead for hard prompts.

With style control, this lead reduces to +19 points, but the leap over previous Gemini remains at +46 points. Clearly, Google cooked something up with this release! (4)

December 7, 2024 at 8:40 PM

LMAnalysis

@mlanalysis.bsky.social

With style control, the new Gemini's lead expands to +36 over the previous Gemini, though it's now tied with GPT-4o. But sure, this could be Google figuring out how to bypass the style control filters at LMArena. So let's take a look at hard prompts. (3)

December 7, 2024 at 8:39 PM

LMAnalysis

@mlanalysis.bsky.social

On the overall leaderboard of lmarena.ai, Gemini makes a 15-point jump over its two-week-old predecessor to regain first place in human preference.

It looks like an incremental upgrade in the endless optimization war between Google and OpenAI over the last few months- until you start filtering.(2)

December 7, 2024 at 8:39 PM

LMAnalysis

@mlanalysis.bsky.social

Yeah it seems really cool, a munch more powerful finetuning method, but based on the sign up form seems quite selective.
Also it will only be rolled out next year

December 7, 2024 at 4:32 PM

LMAnalysis

@mlanalysis.bsky.social

📌

December 6, 2024 at 11:02 PM

LMAnalysis

@mlanalysis.bsky.social

📌

December 6, 2024 at 10:58 PM

LMAnalysis

@mlanalysis.bsky.social

A sneak peek of some topics I want to write about in the near future!
- Approaching benchmark limits: what comes next?
- Rankings and style control at LMArena: the 'hidden details' of the leaderboard
- o1's (not so) groundbreaking performance...
- Gemini's new is probably cooler than you think!

December 6, 2024 at 10:48 PM

LMAnalysis

@mlanalysis.bsky.social

My main focus for now will be benchmarks of LMSys/LMArena, as it has been widely adopted as the 'gold standard' of a model's performance for everyday use.
But I will also cover many other benchmarks to showcase their strengths and weaknesses at furthering our understanding of an LLM's performance.

December 6, 2024 at 10:48 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news