#AIbenchmarks
AI benchmarks are a bad joke – and LLM makers are the ones laughing https://www. theregister.com/2025/11/07/mea suring_ai_models_hampered_by/ # HackerNews # AIbenchmarks # LLMs # AIethics # technews # humor

Interest | Match | Feed
Origin
mastodon.social
November 8, 2025 at 3:29 PM
AI benchmarks are a bad joke – and LLM makers are the ones laughing https://www. theregister.com/2025/11/07/mea suring_ai_models_hampered_by/ # HackerNews # AIbenchmarks # LLMs # AIethics # technews # humor

Interest | Match | Feed
Origin
mastodon.social
November 8, 2025 at 3:29 PM
A #study from the #Oxford Internet Institute analysed 445 #AIbenchmarks, finding that many #oversell #AIperformance and lack scientific rigour. The study highlights issues like #uncleardefinitions, #datareuse, and inadequate #statisticalmethods, calling for more rigorous and transparent benchmark…
November 7, 2025 at 2:37 PM
AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds #Technology #SoftwareEngineering #ArtificialIntelligence #AIBenchmarks #TechNews
AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds
They're dumber than you think and they might be cheating.
puretech.news
November 6, 2025 at 9:45 PM
Hundreds of the benchmark tests used to vet today’s most powerful AI models are riddled with flaws, a new research report has warned, raising serious questions about how developers measure the safety of widely used LLMs.

www.digit.fyi/flaws-in-ai-...
#tech #AI #AIbenchmarks #AIsafety
Security Experts Uncover Major Flaws in Hundreds of AI Benchmarks
AI researchers have called for urgent reform of model testing, warning that current benchmarks give a false sense of safety.
www.digit.fyi
November 4, 2025 at 1:15 PM
October 18, 2025 at 2:50 PM
The "pelican on a bicycle" SVG test emerged as a unique benchmark. It assesses a model's ability to generate structured code & understand spatial relationships. Some find it useful, while others question its practical relevance for real-world tasks. #AIbenchmarks 4/6
October 17, 2025 at 1:00 PM
October 15, 2025 at 7:01 PM
AI leaderboards are collapsing under Goodhart’s Law. Discover why the next evolution is personal, decentralized, and self-centered. #aibenchmarks
AI Benchmarks: Why Useless, Personalized Agents Prevail
hackernoon.com
October 5, 2025 at 12:31 PM
Current AI benchmarks are criticized for not reflecting real-world coding nuances. A call for more comprehensive benchmarks includes evaluating time-to-completion and the quality of generated code. #AIBenchmarks 6/6
September 30, 2025 at 7:00 AM
September 29, 2025 at 11:03 PM
FLARE, a Faithful Logic‑Aided Reasoning system, achieved state‑of‑the‑art results on 7 of 9 reasoning benchmarks, showing higher model faithfulness ties to better performance. https://getnews.me/flare-introduces-faithful-logic-aided-reasoning-for-ai-question-answering/ #flare #aibenchmarks
September 22, 2025 at 8:14 PM
It's possible for AI tools to advance the UN's #SDGs, but industry needs to align developer goals with community priorities. This requires better #AIBenchmarks. What are benchmarks, and how can they help? @b-cavello.bsky.social explains in this video. Learn more: www.aspendigital.org/event/food-s...
September 22, 2025 at 4:38 PM
Scale AI Launches ‘SEAL Showdown’ LLM Leaderboard - Can it Dethrone LMArena? # AI # ScaleAI # AIBenchmarks # LMArena # SEALShowdown # Tech https:// winbuzzer.com/2025/09/22/scale -ai-launches-seal-showdown-llm-leaderboard-can-it-dethrone-lmarena-xcxwbn

Interest | Match | Feed
Origin
mastodon.social
September 22, 2025 at 4:13 PM
🚀 Impressive leap on the #AI leaderboard: Mistral AI's new Magistral models just jumped up the Artificial Analysis Intelligence Index—punching far above their weight and rivaling models many times larger. Size isn’t everything anymore! #AIbenchmarks #LLMs #MistralAI
September 21, 2025 at 11:44 PM
Hacker News debated a flaw in SWE-bench: AI models could "cheat" by accessing Git history to see problem fixes. This sparked concerns about benchmark validity, the SWE-bench team's response, and broader issues in AI evaluation. #AIBenchmarks 1/6
September 12, 2025 at 1:00 PM
It's time to reclaim #AIForDevelopment. In a new report, our policy fellow Francisco Jure explains how creating new #AIBenchmarks can help steer innovation toward societal resilience. Learn more in "Reclaiming AI for Development": www.aspendigital.org/report/reclaiming-ai-for-development
September 10, 2025 at 7:12 PM
Benchmarking shows Apertus excels in general knowledge & multilingual support. While it lags behind models like Llama 3.1 in code & reasoning, its detailed performance reports offer crucial insights for future improvements. #AIBenchmarks 5/5
September 6, 2025 at 7:00 PM
How can we steer innovation toward resilience? It's time to reclaim #AIForDevelopment. Our policy fellow Francisco Jure explains that creating new #AIBenchmarks can help. Learn more in "Reclaiming AI for Development": www.aspendigital.org/report/reclaiming-ai-for-development
August 28, 2025 at 6:36 PM
Ultimately, while direct IQ tests for AI are flawed, the debate itself illuminates what we value in intelligence. It pushes us to define AI's cognitive strengths & weaknesses more precisely, beyond human-centric benchmarks. #AIBenchmarks 6/6
August 18, 2025 at 10:15 AM
BREAKING: Gemini 3.0 leak suggests Google just crushed the competition 💥

The leaked "Humanity's Last Exam" scores have me shook:
Gemini 3.0: 32.4%
GPT-5: 26.5%

#GeminiLeak #AIWars #GoogleAI #GPT5 #TechRumors #ArtificialIntelligence #NoCodeAI #TechNews #AIBenchmarks #FutureIsNow
August 15, 2025 at 10:10 PM
The craziest leak claims photo‑to‑app generation that feels like instant co‑founder energy, shifting the benchmark meta and raising the stakes for everyone in the AI race ⚡🧪📊📲

#Gemini3Leak #GPT5 #Grok4 #AIBenchmarks #PhotoToApp #NoCodeAI #GoogleDeepMind #AIGeneration #TechLeak #AICompetition
August 12, 2025 at 8:16 PM
Apple’s new AI models revealed at WWDC 2025 reportedly underperformed compared to expectation.
Despite the fanfare, benchmark scores fell short of rivals

👉 Read more: techthrilled.com/apple-ai-mod...

#AppleAI #WWDC2025 #AIPerformance #AIbenchmarks #TechNews #AITechnology
Apple CEO Tim Cook: “We Must Win in AI”
Apple CEO Tim Cook unveils an ambitious AI strategy, declaring AI a top priority as Apple races to lead the next wave of innovation in artificial intelligence.
techthrilled.com
August 10, 2025 at 8:07 AM