I feel like execs of major ML/AI conferences from ICLR/NeurIPS/ICML/AAAI, ACL/EMNLP, to CVPR should sit together and figure out a whole new strategy moving forward like 👇
I feel like execs of major ML/AI conferences from ICLR/NeurIPS/ICML/AAAI, ACL/EMNLP, to CVPR should sit together and figure out a whole new strategy moving forward like 👇
If you've changed your name and dealt with updating publications, we want to hear your experience. Any reason counts: transition, marriage, cultural reasons, etc.
forms.cloud.microsoft/e/E0XXBmZdEP
If you've changed your name and dealt with updating publications, we want to hear your experience. Any reason counts: transition, marriage, cultural reasons, etc.
forms.cloud.microsoft/e/E0XXBmZdEP
ReSkies much appreciated
ReSkies much appreciated
"Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps"
by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov
aclanthology.org/2025.emnlp-m...
6/n
"Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps"
by Martin Tutek, Fateme Hashemi Chaleshtori, Ana Marasovic, and Yonatan Belinkov
aclanthology.org/2025.emnlp-m...
6/n
KSoC: utah.peopleadmin.com/postings/190... (AI broadly)
Education + AI:
- utah.peopleadmin.com/postings/189...
- utah.peopleadmin.com/postings/190...
Computer Vision:
- utah.peopleadmin.com/postings/183...
KSoC: utah.peopleadmin.com/postings/190... (AI broadly)
Education + AI:
- utah.peopleadmin.com/postings/189...
- utah.peopleadmin.com/postings/190...
Computer Vision:
- utah.peopleadmin.com/postings/183...
Starting with the ✨Best Paper award ✨:
"Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index"
by Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
aclanthology.org/2025.emnlp-m...
1/n
Starting with the ✨Best Paper award ✨:
"Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index"
by Hao Xu, Jiacheng Liu, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi
aclanthology.org/2025.emnlp-m...
1/n
This framework and approach to measuring CoT faithfulness have been hugely influential for how I think about reasoning evaluation, and I'm so lucky to have worked with such brilliant collaborators. Huge credit to @mtutek.bsky.social
Huge thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social , this would not have been possible without them!
This framework and approach to measuring CoT faithfulness have been hugely influential for how I think about reasoning evaluation, and I'm so lucky to have worked with such brilliant collaborators. Huge credit to @mtutek.bsky.social
Huge thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social , this would not have been possible without them!
Huge thanks to my amazing collaborators @fatemehc.bsky.social @anamarasovic.bsky.social @boknilev.bsky.social , this would not have been possible without them!
If you care about CoT faithfulness, you 𝘮𝘶𝘴𝘵 read this paper. It introduces the first method for measuring CoT faithfulness that is not purely behavioral, but operates with the internals!
I'll present our parametric CoT faithfulness work (arxiv.org/abs/2502.14829) on Wednesday at the second Interpretability session, 16:30-18:00 local time A104-105
If you're in Suzhou, reach out to talk all things reasoning :)
If you care about CoT faithfulness, you 𝘮𝘶𝘴𝘵 read this paper. It introduces the first method for measuring CoT faithfulness that is not purely behavioral, but operates with the internals!
I'm still so proud of our work (led by @lasha.bsky.social) on CondaQA, so we had to ask what would happen if we tried to create high-quality reasoning-over-text benchmarks now that LLMs are available. Turns out, we'd make an easier benchmark!
📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
I'm still so proud of our work (led by @lasha.bsky.social) on CondaQA, so we had to ask what would happen if we tried to create high-quality reasoning-over-text benchmarks now that LLMs are available. Turns out, we'd make an easier benchmark!
I'll present our parametric CoT faithfulness work (arxiv.org/abs/2502.14829) on Wednesday at the second Interpretability session, 16:30-18:00 local time A104-105
If you're in Suzhou, reach out to talk all things reasoning :)
I'll present our parametric CoT faithfulness work (arxiv.org/abs/2502.14829) on Wednesday at the second Interpretability session, 16:30-18:00 local time A104-105
If you're in Suzhou, reach out to talk all things reasoning :)
📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
📍Findings Session 1 - Hall C
📅 Wed, November 5, 13:00 - 14:00
arxiv.org/abs/2505.22830
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)
In “What Has Been Lost with Synthetic Evaluation”, Ana Marasović (@anamarasovic.bsky.social) and collaborators ask what happens when LLMs start generating the datasets used to test their reasoning. (1/6🧵)
This week on #WiAIRpodcast, we talk with Ana Marasović (@anamarasovic.bsky.social) about her paper “Chain-of-Thought Unfaithfulness as Disguised Accuracy.” (1/6🧵)
📄 Paper: arxiv.org/pdf/2402.14897
This week on #WiAIRpodcast, we talk with Ana Marasović (@anamarasovic.bsky.social) about her paper “Chain-of-Thought Unfaithfulness as Disguised Accuracy.” (1/6🧵)
📄 Paper: arxiv.org/pdf/2402.14897
New work presented today at the COLM Workshop on Socially Responsible Language Modelling Research led by Purbid Bambroo and in collaboration with @anamarasovic.bsky.social that probes LLM preference test sets for redundancy and inflated scores.
1/8
New work presented today at the COLM Workshop on Socially Responsible Language Modelling Research led by Purbid Bambroo and in collaboration with @anamarasovic.bsky.social that probes LLM preference test sets for redundancy and inflated scores.
1/8
1️⃣ Purbid's 𝐩𝐨𝐬𝐭𝐞𝐫 at 𝐒𝐨𝐋𝐚𝐑 (𝟏𝟏:𝟏𝟓𝐚𝐦-𝟏:𝟎𝟎𝐩𝐦) on catching redundant preference pairs & how pruning them hurts accuracy; www.anamarasovic.com/publications...
2️⃣ My 𝐭𝐚𝐥𝐤 at 𝐗𝐋𝐋𝐌-𝐑𝐞𝐚𝐬𝐨𝐧-𝐏𝐥𝐚𝐧 (𝟏𝟐𝐩𝐦) on measuring CoT faithfulness by looking at internals, not just behaviorally
1/3
1️⃣ Purbid's 𝐩𝐨𝐬𝐭𝐞𝐫 at 𝐒𝐨𝐋𝐚𝐑 (𝟏𝟏:𝟏𝟓𝐚𝐦-𝟏:𝟎𝟎𝐩𝐦) on catching redundant preference pairs & how pruning them hurts accuracy; www.anamarasovic.com/publications...
2️⃣ My 𝐭𝐚𝐥𝐤 at 𝐗𝐋𝐋𝐌-𝐑𝐞𝐚𝐬𝐨𝐧-𝐏𝐥𝐚𝐧 (𝟏𝟐𝐩𝐦) on measuring CoT faithfulness by looking at internals, not just behaviorally
1/3
1️⃣ Purbid's 𝐩𝐨𝐬𝐭𝐞𝐫 at 𝐒𝐨𝐋𝐚𝐑 (𝟏𝟏:𝟏𝟓𝐚𝐦-𝟏:𝟎𝟎𝐩𝐦) on catching redundant preference pairs & how pruning them hurts accuracy; www.anamarasovic.com/publications...
2️⃣ My 𝐭𝐚𝐥𝐤 at 𝐗𝐋𝐋𝐌-𝐑𝐞𝐚𝐬𝐨𝐧-𝐏𝐥𝐚𝐧 (𝟏𝟐𝐩𝐦) on measuring CoT faithfulness by looking at internals, not just behaviorally
1/3
1️⃣ Purbid's 𝐩𝐨𝐬𝐭𝐞𝐫 at 𝐒𝐨𝐋𝐚𝐑 (𝟏𝟏:𝟏𝟓𝐚𝐦-𝟏:𝟎𝟎𝐩𝐦) on catching redundant preference pairs & how pruning them hurts accuracy; www.anamarasovic.com/publications...
2️⃣ My 𝐭𝐚𝐥𝐤 at 𝐗𝐋𝐋𝐌-𝐑𝐞𝐚𝐬𝐨𝐧-𝐏𝐥𝐚𝐧 (𝟏𝟐𝐩𝐦) on measuring CoT faithfulness by looking at internals, not just behaviorally
1/3
Happy: Well, at least I can mountain bike during the fall break in prime SLC MTB weather.
Sad: Comes down with a cold.
☹️☹️☹️☹️☹️☹️
Happy: Well, at least I can mountain bike during the fall break in prime SLC MTB weather.
Sad: Comes down with a cold.
☹️☹️☹️☹️☹️☹️
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.
🔗 Watch here → youtu.be/xYb6uokKKOo
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.
🔗 Watch here → youtu.be/xYb6uokKKOo
This time, we sit down with @anamarasovic.bsky.social to unpack some of the toughest questions in AI explainability and trust.
🔗 Watch here → youtu.be/xYb6uokKKOo
@anamarasovic.bsky.social sadly can't make it 😭, but hit me up if you'd like to chat about audio language models, music mixing, or anything else regarding music and audio!