boydgraber.bsky.social
@boydgraber.bsky.social
I think you mean: Terror has spread. Even in The Loop, hundreds run from gunshots. Traffic has ground to a halt. Unable to do anything to prevent the stampede, hundreds scream from the sidewalks. "Police" stand by, doing nothing.
October 13, 2025 at 2:30 PM
We had come at it more from the position of trying to use as few dev examples as possible (to keep them secret). I.e., use the best items you could and every model uses the exact same. But it makes sense to use the adaptive testing scenario if you don't mind potentially exposing more dev.
September 18, 2025 at 8:23 PM
Link to paper since I ran out of room:

users.umiacs.umd.edu/~ying/docs/2...
users.umiacs.umd.edu
September 18, 2025 at 8:19 PM
In 2021, we proposed using IRT to find bad examples and to create more targeted leaderboards (Evaluation
Examples Are Not Equally Informative: How Should That Change NLP Leaderboards?).

From my reading, the big difference seems to be that they're also using the agent's skill, which is super cool!
September 18, 2025 at 8:19 PM
We also found that it's helpful for improving uncertainty estimation of models:

arxiv.org/abs/2205.12507
arxiv.org
September 18, 2025 at 8:13 PM
If it said that 1990 was "about 10 years ago", I would say that it has reached tenured faculty-level intelligence.
September 2, 2025 at 4:09 PM
A couple of weeks ago I left my family behind at a cable car station to finish climbing to the peak of a mountain because they were too scared to continue. When I reached the top, my phone gave a notification: new podcasts available for download. Apparently LMU has an observatory on Wendelstein.
August 21, 2025 at 2:45 PM
Do you mean salary, physical facilities, work environment, or funding ecosystem?
August 4, 2025 at 2:17 PM
At the risk of picking out one of my favorite children, this was the paper with our best traditional video of this cycle (thanks to Jon May for playing along):

t.co/QQlgwzo6jf
t.co/2G6kwAAPMy
https://youtu.be/L_hcHQep3fc
t.co
July 28, 2025 at 8:35 AM
In the second oral paper (14:22 PM, Room 1.62),
@yysung.bsky.social is presenting: GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

x.com/YooYeonSung1...

(Short version: quiz bowl, a dumb trivia game, shows humans' calibration > LLMs'.)
Yoo Yeon Sung@ACL2025 on X: "I’ll be presenting this work in Room 1.62 today! If you're curious about how calibration errors in LLMs can be measured through human calibration, come find me and @enfleisig! 📍Oral Session 3 - HC: Human-centered NLP 📅Monday, July 28@ 2PM" / X
I’ll be presenting this work in Room 1.62 today! If you're curious about how calibration errors in LLMs can be measured through human calibration, come find me and @enfleisig! 📍Oral Session 3 - HC: Human-centered NLP 📅Monday, July 28@ 2PM
x.com
July 28, 2025 at 8:35 AM
https://youtu.be/wuEIeydhamA
t.co
July 28, 2025 at 8:35 AM
Which makes this:
users.umiacs.umd.edu/~ying/docs/n...

"The Hobbit"
users.umiacs.umd.edu
July 14, 2025 at 2:17 PM
And you can signup for online mirror (June 21, 12:00 EST) here:
docs.google.com/forms/d/e/1F...

[Signup deadline: June 18 Anywhere on Earth]
2025 QANTA Player Signup
Sign up for the human competition for our 2025 QANTA event. More information: https://sites.google.com/view/qanta/2025-competition/2025-human-teams
docs.google.com
June 17, 2025 at 3:35 PM
Sara’s Crias went 4-2 to win the tournament (and $150 dollars). Noah Sheidlower’s music packet was the most difficult for computers, and Jame Carlson’s Spatial Reasoning was the fan favorite. We’ll announce writer and computer prizes after our online mirror. (And also post the packets.)
June 17, 2025 at 3:35 PM