Beiduo Chen
@beiduo.bsky.social
ELLIS PhD student in NLP @MaiNLPlab, @CisLmu, @LMU_Muenchen
https://mckysse.github.io/
https://mckysse.github.io/
Our paper: arxiv.org/pdf/2505.23368
Our code: github.com/mainlp/CoT2EL
Thank you to my wonderful co-authors,
@janetlauyeung.bsky.social, Anna Korhonen, and @barbaraplank.bsky.social. Also to @mainlp.bsky.social , @cislmu.bsky.social @munichcenterml.bsky.social
See you in Suzhou!
#NLP #EMNLP2025
Our code: github.com/mainlp/CoT2EL
Thank you to my wonderful co-authors,
@janetlauyeung.bsky.social, Anna Korhonen, and @barbaraplank.bsky.social. Also to @mainlp.bsky.social , @cislmu.bsky.social @munichcenterml.bsky.social
See you in Suzhou!
#NLP #EMNLP2025
arxiv.org
October 24, 2025 at 1:42 PM
Our paper: arxiv.org/pdf/2505.23368
Our code: github.com/mainlp/CoT2EL
Thank you to my wonderful co-authors,
@janetlauyeung.bsky.social, Anna Korhonen, and @barbaraplank.bsky.social. Also to @mainlp.bsky.social , @cislmu.bsky.social @munichcenterml.bsky.social
See you in Suzhou!
#NLP #EMNLP2025
Our code: github.com/mainlp/CoT2EL
Thank you to my wonderful co-authors,
@janetlauyeung.bsky.social, Anna Korhonen, and @barbaraplank.bsky.social. Also to @mainlp.bsky.social , @cislmu.bsky.social @munichcenterml.bsky.social
See you in Suzhou!
#NLP #EMNLP2025
Matching exact probabilities for HLV is unstable. So, we propose a more robust rank-based evaluation that checks preference order. Our combined method outperforms baselines on 3 datasets that exhibit human label variation, showing it better aligns with diverse human perspectives.
October 24, 2025 at 1:37 PM
Matching exact probabilities for HLV is unstable. So, we propose a more robust rank-based evaluation that checks preference order. Our combined method outperforms baselines on 3 datasets that exhibit human label variation, showing it better aligns with diverse human perspectives.
Instead of unnatural post-hoc explanations, we look forward. A model's CoT already contains rationales for all options. We introduce CoT2EL, a pipeline that uses linguistic discourse segmenters to extract these high-quality, faithful units to explore human label variation.
October 24, 2025 at 1:37 PM
Instead of unnatural post-hoc explanations, we look forward. A model's CoT already contains rationales for all options. We introduce CoT2EL, a pipeline that uses linguistic discourse segmenters to extract these high-quality, faithful units to explore human label variation.
🌍 Broader impact:
Our approach makes capturing disagreement scalable, helping build datasets that reflect real-world ambiguity—without requiring tons of human-written explanations.
Open-sourcing:
📂 github.com/mainlp/MJD-E...
Our approach makes capturing disagreement scalable, helping build datasets that reflect real-world ambiguity—without requiring tons of human-written explanations.
Open-sourcing:
📂 github.com/mainlp/MJD-E...
GitHub - mainlp/MJD-Estimator: Implementation of the EMNLP 2024 paper - "Seeing the Big through the Small": Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?; and the A...
Implementation of the EMNLP 2024 paper - "Seeing the Big through the Small": Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?; and the ACL 2025 paper - A ...
github.com
July 15, 2025 at 2:51 PM
🌍 Broader impact:
Our approach makes capturing disagreement scalable, helping build datasets that reflect real-world ambiguity—without requiring tons of human-written explanations.
Open-sourcing:
📂 github.com/mainlp/MJD-E...
Our approach makes capturing disagreement scalable, helping build datasets that reflect real-world ambiguity—without requiring tons of human-written explanations.
Open-sourcing:
📂 github.com/mainlp/MJD-E...
🧠 What’s this about?
Human annotations often disagree. Instead of collapsing disagreement into a single label, we model Human Judgment Distributions — how likely humans are to choose each label in NLI tasks.
Capturing this is crucial for interpretability and uncertainty in NLP.
Human annotations often disagree. Instead of collapsing disagreement into a single label, we model Human Judgment Distributions — how likely humans are to choose each label in NLI tasks.
Capturing this is crucial for interpretability and uncertainty in NLP.
July 15, 2025 at 2:50 PM
🧠 What’s this about?
Human annotations often disagree. Instead of collapsing disagreement into a single label, we model Human Judgment Distributions — how likely humans are to choose each label in NLI tasks.
Capturing this is crucial for interpretability and uncertainty in NLP.
Human annotations often disagree. Instead of collapsing disagreement into a single label, we model Human Judgment Distributions — how likely humans are to choose each label in NLI tasks.
Capturing this is crucial for interpretability and uncertainty in NLP.
🔗 Paper link: arxiv.org/abs/2412.13942
🙏 Huge thanks to our collaborators Logan Siyao Peng, @barbaraplank.bsky.social, Anna Korhonen from @mainlp.bsky.social, @lmumuenchen.bsky.social, @cambridgeltl.bsky.social
🙏 Huge thanks to our collaborators Logan Siyao Peng, @barbaraplank.bsky.social, Anna Korhonen from @mainlp.bsky.social, @lmumuenchen.bsky.social, @cambridgeltl.bsky.social
A Rose by Any Other Name: LLM-Generated Explanations Are Good Proxies for Human Explanations to Collect Label Distributions on NLI
Disagreement in human labeling is ubiquitous, and can be captured in human judgment distributions (HJDs). Recent research has shown that explanations provide valuable information for understanding hum...
arxiv.org
July 15, 2025 at 2:50 PM
🔗 Paper link: arxiv.org/abs/2412.13942
🙏 Huge thanks to our collaborators Logan Siyao Peng, @barbaraplank.bsky.social, Anna Korhonen from @mainlp.bsky.social, @lmumuenchen.bsky.social, @cambridgeltl.bsky.social
🙏 Huge thanks to our collaborators Logan Siyao Peng, @barbaraplank.bsky.social, Anna Korhonen from @mainlp.bsky.social, @lmumuenchen.bsky.social, @cambridgeltl.bsky.social