Lightnews — Scholar-powered news

ccahua.bsky.social

@ccahua.bsky.social

Draw A Square With 3 Lines #Specifications #LlmEval #Assumptions #Alignment youtube.com/shorts/w906q...

draw a square with three lines. #education #maths #school #students #youtubeshorts

YouTube video by Raviraj Master

youtube.com

October 30, 2025 at 3:53 AM

GetNews.me

@getnews-me.bsky.social

An LLM‑based metric scores speech models via log‑likelihood of token sequences, avoiding extra training. Accepted to the 2025 IEEE ASRU conference, it showed high correlation with ASR benchmark. https://getnews.me/llm-metric-scores-self-supervised-speech-models-without-training/ #asru2025 #llmeval

LLM Metric Scores Self‑Supervised Speech Models Without Training

October 8, 2025 at 1:40 AM

ResearchTrend.AI Daily

@researchtrend.ai

[2025-08-13] 📚 Updates in #ALM

(1) LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
(2) Technical Report: Full-Stack Fine-Tuning for the Q Programming Language

🔍 More at researchtrend.ai/communities/ALM

August 13, 2025 at 3:03 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

Ming Zhang, et al.: LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models https://arxiv.org/abs/2508.05452 https://arxiv.org/pdf/2508.05452 https://arxiv.org/html/2508.05452

August 8, 2025 at 6:30 AM

arxiv cs.CL

@arxiv-cs-cl.bsky.social

Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, ...
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
https://arxiv.org/abs/2508.05452

August 8, 2025 at 5:52 AM

ResearchTrend.AI Daily

@researchtrend.ai

[2025-06-05] 📚 Updates in #LM&MA

(1) <a href="https://researchtrend.ai/papers/2506.04078" class="hover:underline text-blue-600 dark:text-sky-400 no-card-link" target="_blank" rel="noopener" data-link="bsky">LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
(2) LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

🔍 More at researchtrend.ai/communities/LM&MA

June 6, 2025 at 8:13 AM

arxiv cs.CL

@arxiv-cs-cl.bsky.social

Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, ...
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
https://arxiv.org/abs/2506.04078

June 5, 2025 at 6:24 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med. [6/6 of https://arxiv.org/abs/2506.04078v1]

June 5, 2025 at 6:11 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective [5/6 of https://arxiv.org/abs/2506.04078v1]

June 5, 2025 at 6:11 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We [3/6 of https://arxiv.org/abs/2506.04078v1]

June 5, 2025 at 6:11 AM

arXiv cs.CL Computation and Language

@cscl-bot.bsky.social

Zhang, Shen, Li, Sha, Hu, Wang, Huang, Liu, Tong, Jiang, Chai, Xi, Dou, Gui, Zhang, Huang: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation https://arxiv.org/abs/2506.04078 https://arxiv.org/pdf/2506.04078 https://arxiv.org/html/2506.04078

June 5, 2025 at 6:11 AM

Travis

@travismh.bsky.social

It's LLMEval from the mlx-swift-examples repo

February 4, 2025 at 6:20 PM

Travis

@travismh.bsky.social

It's LLMEval from the mlx-swift-examples repo

February 4, 2025 at 6:20 PM

Jekaterina Novikova

@j-novikova-nlp.bsky.social

Excited to co-organize the HEAL workshop at
@acm_chi
2025!
HEAL addresses the "evaluation crisis" in LLM research and brings HCI and AI experts together to develop human-centered approaches to evaluating and auditing LLMs.
🔗 heal-workshop.github.io
#NLProc #LLMeval #LLMsafety

January 3, 2025 at 2:07 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news