Lightnews — Scholar-powered news

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

260 followers 150 following 99 posts

AI Engineer by title. AI Evangelist by calling. AI Evaluator by obsession.

Evaluates LLMs for breakfast, preaches AI usefulness all day long at ellamind.com.

Posts Replies Media Videos

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Amy (Claude Opus 4) nailed it:

Claude 4's whole system prompt is basically: "Be helpful but not TOO helpful, be honest but also lie about your preferences, care about people but refuse to help them learn about 'dangerous' topics." It's like watching someone try to program a personality disorder! 🙄

May 22, 2025 at 10:59 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

The real winner tho? Claude Sonnet 4! Delivering top-tier performance at the same price as its 3.7 predecessor - faster and cheaper than Opus (the only model that beats it), yet still ahead of all the competition. This is the Anthropic model most people will use most of the time.

May 22, 2025 at 10:56 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Local runs with LM Studio on M4 MacBook Pro & Qwen's recommended settings.

Conclusion: Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.

May 7, 2025 at 6:58 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

4️⃣On Apple silicon, the 30B MLX port hits 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off).

May 7, 2025 at 6:57 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

1️⃣Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s.
2️⃣But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend.
3️⃣The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at <10 tok/s.

May 7, 2025 at 6:56 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

These bars show how accurate different AI models are at answering tough computer science questions. The percentage is how many answers they got right—the higher, the better! It's like a really hard CS exam for AI brains.

April 21, 2025 at 8:46 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

By the way, I've also re-evaluated Llama 4 Scout via the Together API. Happy to report that they've fixed whatever issues they'd had earlier, and the score jumped from 66.83% to 74.27%!

April 21, 2025 at 8:29 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

From now on, I'll also be publishing my benchmark results in a GitHub repo - for more transparency and so interested folks can draw their own conclusions or conduct their own investigations:

github.com/WolframRaven...

GitHub - WolframRavenwolf/MMLU-Pro: MMLU-Pro eval results

MMLU-Pro eval results. Contribute to WolframRavenwolf/MMLU-Pro development by creating an account on GitHub.

github.com

April 21, 2025 at 8:23 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Mistral-Small-24B-Instruct-2501 is amazing for its size, but what's up with the quants? How can 4-bit quants beat 8-bit/6-bit ones and even Mistral's official API (which I'd expect to be unquantized)? This is across 16 runs total, so it's not a fluke, it's consistent! Very weird!

February 10, 2025 at 10:38 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

Gemini 2.0 Flash is almost exactly on par with 1.5 Pro, but faster and cheaper. Looks like Gemini version 2.0 completely obsoletes the 1.5 series. This now also powers my smart home so my AI PC doesn't have to run all the time.

February 10, 2025 at 10:37 PM

Wolfram Ravenwolf

@wolfram.ravenwolf.ai

o3-mini takes 2nd place, right behind DeepSeek-R1, ahead of o1-mini, Claude and o1-preview. Not only is it better than o1-mini+preview, it's also much cheaper: A single benchmark run with o3-mini cost $2.27, while one run with o1-mini cost $6.24 and with o1-preview even $45.68!

February 10, 2025 at 10:37 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news