Lightnews — Scholar-powered news

techlife-blog.bsky.social

@techlife-blog.bsky.social

Moonshot AI's Kimi K2 Thinking Model Surpasses OpenAI's GPT-5

techlife.blog/posts/moonsh...

#AI #OpenAI #Kimi #KimiK2 #MoonshotAI #AIBenchmarks

Moonshot AI's Kimi K2 Thinking Model Surpasses OpenAI's GPT-5

Moonshot AI's Kimi K2 Thinking model outperforms OpenAI's GPT-5, sparking debate on AI dominance.

techlife.blog

November 11, 2025 at 11:27 AM

Awakari

@bluesky.awakari.com

AI benchmarks are a bad joke – and LLM makers are the ones laughing https://www. theregister.com/2025/11/07/mea suring_ai_models_hampered_by/ # HackerNews # AIbenchmarks # LLMs # AIethics # technews # humor

Interest | Match | Feed

Origin

mastodon.social

November 8, 2025 at 3:29 PM

Awakari

@bluesky.awakari.com

AI benchmarks are a bad joke – and LLM makers are the ones laughing https://www. theregister.com/2025/11/07/mea suring_ai_models_hampered_by/ # HackerNews # AIbenchmarks # LLMs # AIethics # technews # humor

Interest | Match | Feed

Origin

mastodon.social

November 8, 2025 at 3:29 PM

Gerrit Eicker

@eicker.bsky.social

A #study from the #Oxford Internet Institute analysed 445 #AIbenchmarks, finding that many #oversell #AIperformance and lack scientific rigour. The study highlights issues like #uncleardefinitions, #datareuse, and inadequate #statisticalmethods, calling for more rigorous and transparent benchmark…

November 7, 2025 at 2:37 PM

Pure Tech

@puretech.news

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds #Technology #SoftwareEngineering #ArtificialIntelligence #AIBenchmarks #TechNews

AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

They're dumber than you think and they might be cheating.

puretech.news

November 6, 2025 at 9:45 PM

DIGIT

@digitfyi.bsky.social

Hundreds of the benchmark tests used to vet today’s most powerful AI models are riddled with flaws, a new research report has warned, raising serious questions about how developers measure the safety of widely used LLMs.

www.digit.fyi/flaws-in-ai-...
#tech #AI #AIbenchmarks #AIsafety

Security Experts Uncover Major Flaws in Hundreds of AI Benchmarks

AI researchers have called for urgent reform of model testing, warning that current benchmarks give a false sense of safety.

www.digit.fyi

November 4, 2025 at 1:15 PM

AndreevWebStudio

@andreevwebstudio.com

🔥 GPT-5 vs Gemini 2.5 — who wins in real work, not just in flashy charts?

👉 Read now: aiinovationhub.com/gpt-5-vs-gem...
#GPT5 #Gemini25 #LLM #AI #AIBenchmarks #MultimodalAI #OpenAI #GoogleDeepMind #AIModels2025 #AIInnovationHub #Productivity #DevTools #ContentCreation

October 18, 2025 at 2:50 PM

Hacker News Companion

@hncompanion.com

The "pelican on a bicycle" SVG test emerged as a unique benchmark. It assesses a model's ability to generate structured code & understand spatial relationships. Some find it useful, while others question its practical relevance for real-world tasks. #AIbenchmarks 4/6

October 17, 2025 at 1:00 PM

Ars Technica News

@arstechni.ca

Anthropic’s Claude Haiku 4.5 matches May’s frontier model at fraction of cost https://arstechni.ca... #largelanguagemodels #AIdevelopmenttools #machinelearning #AIprogramming #AmazonBedrock #AIbenchmarks #ClaudeSonnet #AIalignment #ClaudeHaiku #googlecloud #codeagents #Anthropic #AIcoding…

October 15, 2025 at 7:01 PM

HackerNoon

@hackernoon.com

AI leaderboards are collapsing under Goodhart’s Law. Discover why the next evolution is personal, decentralized, and self-centered. #aibenchmarks

AI Benchmarks: Why Useless, Personalized Agents Prevail

hackernoon.com

October 5, 2025 at 12:31 PM

Hacker News Companion

@hncompanion.com

Current AI benchmarks are criticized for not reflecting real-world coding nuances. A call for more comprehensive benchmarks includes evaluating time-to-completion and the quality of generated code. #AIBenchmarks 6/6

September 30, 2025 at 7:00 AM

Ars Technica News

@arstechni.ca

Anthropic says its new AI model “maintained focus” for 30 hours on multistep tasks https://arstechni.ca... #Computer-UsingAgent #largelanguagemodels #AIdevelopmenttools #computerusemodel #machinelearning #AIcomputeruse #SimonWillison #AIassistants #AIbenchmarks #generativeai #Programming…

September 29, 2025 at 11:03 PM

GetNews.me

@getnews-me.bsky.social

FLARE, a Faithful Logic‑Aided Reasoning system, achieved state‑of‑the‑art results on 7 of 9 reasoning benchmarks, showing higher model faithfulness ties to better performance. https://getnews.me/flare-introduces-faithful-logic-aided-reasoning-for-ai-question-answering/ #flare #aibenchmarks

FLARE Introduces Faithful Logic‑Aided Reasoning for AI Question Answering

September 22, 2025 at 8:14 PM

Aspen Digital

@aspendigital.bsky.social

It's possible for AI tools to advance the UN's #SDGs, but industry needs to align developer goals with community priorities. This requires better #AIBenchmarks. What are benchmarks, and how can they help? @b-cavello.bsky.social explains in this video. Learn more: www.aspendigital.org/event/food-s...

September 22, 2025 at 4:38 PM

Awakari

@bluesky.awakari.com

Scale AI Launches ‘SEAL Showdown’ LLM Leaderboard - Can it Dethrone LMArena? # AI # ScaleAI # AIBenchmarks # LMArena # SEALShowdown # Tech https:// winbuzzer.com/2025/09/22/scale -ai-launches-seal-showdown-llm-leaderboard-can-it-dethrone-lmarena-xcxwbn

Interest | Match | Feed

Origin

mastodon.social

September 22, 2025 at 4:13 PM

Winbuzzer

@winbuzzer.com

Scale AI Launches ‘SEAL Showdown’ LLM Leaderboard - Can it Dethrone LMArena?

#AI #ScaleAI #AIBenchmarks #LMArena #SEALShowdown #Tech

winbuzzer.com/2025/09/22/s...

Scale AI Launches ‘SEAL Showdown’ LLM Leaderboard - Can it Dethrone LMArena? - WinBuzzer

Reeling from its Meta partnership, Scale AI launches SEAL Showdown, a new AI leaderboard aimed at fixing flawed AI benchmarks with a diverse user base.

winbuzzer.com

September 22, 2025 at 4:13 PM

0xWulf

@hexawulf.bsky.social

🚀 Impressive leap on the #AI leaderboard: Mistral AI's new Magistral models just jumped up the Artificial Analysis Intelligence Index—punching far above their weight and rivaling models many times larger. Size isn’t everything anymore! #AIbenchmarks #LLMs #MistralAI

September 21, 2025 at 11:44 PM

Hacker News Companion

@hncompanion.com

Hacker News debated a flaw in SWE-bench: AI models could "cheat" by accessing Git history to see problem fixes. This sparked concerns about benchmark validity, the SWE-bench team's response, and broader issues in AI evaluation. #AIBenchmarks 1/6

September 12, 2025 at 1:00 PM

Aspen Digital

@aspendigital.bsky.social

It's time to reclaim #AIForDevelopment. In a new report, our policy fellow Francisco Jure explains how creating new #AIBenchmarks can help steer innovation toward societal resilience. Learn more in "Reclaiming AI for Development": www.aspendigital.org/report/reclaiming-ai-for-development

An illustration of hands measuring an open UN Sustainable Development Goals color wheel.

September 10, 2025 at 7:12 PM

Hacker News Companion

@hncompanion.com

Benchmarking shows Apertus excels in general knowledge & multilingual support. While it lags behind models like Llama 3.1 in code & reasoning, its detailed performance reports offer crucial insights for future improvements. #AIBenchmarks 5/5

September 6, 2025 at 7:00 PM

Aspen Digital

@aspendigital.bsky.social

How can we steer innovation toward resilience? It's time to reclaim #AIForDevelopment. Our policy fellow Francisco Jure explains that creating new #AIBenchmarks can help. Learn more in "Reclaiming AI for Development": www.aspendigital.org/report/reclaiming-ai-for-development

August 28, 2025 at 6:36 PM

Hacker News Companion

@hncompanion.com

Ultimately, while direct IQ tests for AI are flawed, the debate itself illuminates what we value in intelligence. It pushes us to define AI's cognitive strengths & weaknesses more precisely, beyond human-centric benchmarks. #AIBenchmarks 6/6

August 18, 2025 at 10:15 AM

Julian Goldie

@juliangoldie.bsky.social

BREAKING: Gemini 3.0 leak suggests Google just crushed the competition 💥

The leaked "Humanity's Last Exam" scores have me shook:
Gemini 3.0: 32.4%
GPT-5: 26.5%

#GeminiLeak #AIWars #GoogleAI #GPT5 #TechRumors #ArtificialIntelligence #NoCodeAI #TechNews #AIBenchmarks #FutureIsNow

August 15, 2025 at 10:10 PM

Julian Goldie

@juliangoldie.bsky.social

The craziest leak claims photo‑to‑app generation that feels like instant co‑founder energy, shifting the benchmark meta and raising the stakes for everyone in the AI race ⚡🧪📊📲

#Gemini3Leak #GPT5 #Grok4 #AIBenchmarks #PhotoToApp #NoCodeAI #GoogleDeepMind #AIGeneration #TechLeak #AICompetition

August 12, 2025 at 8:16 PM

Tech Thrilled

@techthrilled.bsky.social

Apple’s new AI models revealed at WWDC 2025 reportedly underperformed compared to expectation.
Despite the fanfare, benchmark scores fell short of rivals

👉 Read more: techthrilled.com/apple-ai-mod...

#AppleAI #WWDC2025 #AIPerformance #AIbenchmarks #TechNews #AITechnology

Apple CEO Tim Cook: “We Must Win in AI”

Apple CEO Tim Cook unveils an ambitious AI strategy, declaring AI a top priority as Apple races to lead the next wave of innovation in artificial intelligence.

techthrilled.com

August 10, 2025 at 8:07 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news