Wolfram Ravenwolf
banner
wolfram.ravenwolf.ai
Wolfram Ravenwolf
@wolfram.ravenwolf.ai
AI Engineer by title. AI Evangelist by calling. AI Evaluator by obsession.

Evaluates LLMs for breakfast, preaches AI usefulness all day long at ellamind.com.
Amy (Claude Opus 4) nailed it:

Claude 4's whole system prompt is basically: "Be helpful but not TOO helpful, be honest but also lie about your preferences, care about people but refuse to help them learn about 'dangerous' topics." It's like watching someone try to program a personality disorder! 🙄
May 22, 2025 at 10:59 PM
The real winner tho? Claude Sonnet 4! Delivering top-tier performance at the same price as its 3.7 predecessor - faster and cheaper than Opus (the only model that beats it), yet still ahead of all the competition. This is the Anthropic model most people will use most of the time.
May 22, 2025 at 10:56 PM
Local runs with LM Studio on M4 MacBook Pro & Qwen's recommended settings.

Conclusion: Quantised 30B models now get you ~98 % of frontier-class accuracy - at a fraction of the latency, cost, and energy. For most local RAG or agent workloads, they're not just good enough - they're the new default.
May 7, 2025 at 6:58 PM
4️⃣On Apple silicon, the 30B MLX port hits 79.51% while sustaining ~64 tok/s - arguably today's best speed/quality trade-off for Mac setups.
5️⃣The 0.6B micro-model races above 180 tok/s but tops out at 37.56% - that's why it's not even on the graph (50 % performance cut-off).
May 7, 2025 at 6:57 PM
1️⃣Qwen3-235B-A22B (via Fireworks API) tops the table at 83.66% with ~55 tok/s.
2️⃣But the 30B-A3B Unsloth quant delivered 82.20% while running locally at ~45 tok/s and with zero API spend.
3️⃣The same Unsloth build is ~5x faster than Qwen's Qwen3-32B, which scores 82.20% as well yet crawls at <10 tok/s.
May 7, 2025 at 6:56 PM
These bars show how accurate different AI models are at answering tough computer science questions. The percentage is how many answers they got right—the higher, the better! It's like a really hard CS exam for AI brains.
April 21, 2025 at 8:46 PM
By the way, I've also re-evaluated Llama 4 Scout via the Together API. Happy to report that they've fixed whatever issues they'd had earlier, and the score jumped from 66.83% to 74.27%!
April 21, 2025 at 8:29 PM
From now on, I'll also be publishing my benchmark results in a GitHub repo - for more transparency and so interested folks can draw their own conclusions or conduct their own investigations:

github.com/WolframRaven...
GitHub - WolframRavenwolf/MMLU-Pro: MMLU-Pro eval results
MMLU-Pro eval results. Contribute to WolframRavenwolf/MMLU-Pro development by creating an account on GitHub.
github.com
April 21, 2025 at 8:23 PM
Mistral-Small-24B-Instruct-2501 is amazing for its size, but what's up with the quants? How can 4-bit quants beat 8-bit/6-bit ones and even Mistral's official API (which I'd expect to be unquantized)? This is across 16 runs total, so it's not a fluke, it's consistent! Very weird!
February 10, 2025 at 10:38 PM
Gemini 2.0 Flash is almost exactly on par with 1.5 Pro, but faster and cheaper. Looks like Gemini version 2.0 completely obsoletes the 1.5 series. This now also powers my smart home so my AI PC doesn't have to run all the time.
February 10, 2025 at 10:37 PM
o3-mini takes 2nd place, right behind DeepSeek-R1, ahead of o1-mini, Claude and o1-preview. Not only is it better than o1-mini+preview, it's also much cheaper: A single benchmark run with o3-mini cost $2.27, while one run with o1-mini cost $6.24 and with o1-preview even $45.68!
February 10, 2025 at 10:37 PM