#LlmEval
An LLM‑based metric scores speech models via log‑likelihood of token sequences, avoiding extra training. Accepted to the 2025 IEEE ASRU conference, it showed high correlation with ASR benchmark. https://getnews.me/llm-metric-scores-self-supervised-speech-models-without-training/ #asru2025 #llmeval
October 8, 2025 at 1:40 AM
Ming Zhang, et al.: LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models https://arxiv.org/abs/2508.05452 https://arxiv.org/pdf/2508.05452 https://arxiv.org/html/2508.05452
August 8, 2025 at 6:30 AM
Ming Zhang, Yujiong Shen, Jingyi Deng, Yuhui Wang, Yue Zhang, Junzhe Wang, Shichun Liu, Shihan Dou, Huayu Sha, Qiyuan Peng, Changhao Jiang, Jingqi Tong, ...
LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
https://arxiv.org/abs/2508.05452
August 8, 2025 at 5:52 AM
[2025-06-05] 📚 Updates in #LM&MA

(1) <a href="https://researchtrend.ai/papers/2506.04078" class="hover:underline text-blue-600 dark:text-sky-400 no-card-link" target="_blank" rel="noopener" data-link="bsky">LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
(2) LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation

🔍 More at researchtrend.ai/communities/LM&MA
June 6, 2025 at 8:13 AM
Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, ...
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
https://arxiv.org/abs/2506.04078
June 5, 2025 at 6:24 AM
deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med. [6/6 of https://arxiv.org/abs/2506.04078v1]
June 5, 2025 at 6:11 AM
and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective [5/6 of https://arxiv.org/abs/2506.04078v1]
June 5, 2025 at 6:11 AM
assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We [3/6 of https://arxiv.org/abs/2506.04078v1]
June 5, 2025 at 6:11 AM
Zhang, Shen, Li, Sha, Hu, Wang, Huang, Liu, Tong, Jiang, Chai, Xi, Dou, Gui, Zhang, Huang: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation https://arxiv.org/abs/2506.04078 https://arxiv.org/pdf/2506.04078 https://arxiv.org/html/2506.04078
June 5, 2025 at 6:11 AM
It's LLMEval from the mlx-swift-examples repo
February 4, 2025 at 6:20 PM
It's LLMEval from the mlx-swift-examples repo
February 4, 2025 at 6:20 PM
Excited to co-organize the HEAL workshop at
@acm_chi
2025!
HEAL addresses the "evaluation crisis" in LLM research and brings HCI and AI experts together to develop human-centered approaches to evaluating and auditing LLMs.
🔗 heal-workshop.github.io
#NLProc #LLMeval #LLMsafety
January 3, 2025 at 2:07 AM