Elias Stengel-Eskin
@esteng.bsky.social
Postdoc @UNC working on NLP, AI, and computational linguistics. Formerly PhD student @JHU and undergrad @McGill
esteng.github.io
esteng.github.io
Thank you @mohitbansal.bsky.social -- I have learned so much from your mentorship (and benefitted greatly from your job market guidance), and consider myself extremely fortunate to have found such a fantastic lab and postdoc advisor!
May 5, 2025 at 11:27 PM
Thank you @mohitbansal.bsky.social -- I have learned so much from your mentorship (and benefitted greatly from your job market guidance), and consider myself extremely fortunate to have found such a fantastic lab and postdoc advisor!
Thanks @kmahowald.bsky.social looking forward to collaborating!
May 5, 2025 at 9:49 PM
Thanks @kmahowald.bsky.social looking forward to collaborating!
And of course thank you to the amazing students/collaborators from @unccs.bsky.social and @jhuclsp.bsky.social 🙏
May 5, 2025 at 8:28 PM
And of course thank you to the amazing students/collaborators from @unccs.bsky.social and @jhuclsp.bsky.social 🙏
A huge shoutout to my mentors who have supported and shaped my research! Esp. grateful to my postdoc advisor @mohitbansal.bsky.social for helping me grow along the whole spectrum of PI skills, and my PhD advisor @vandurme.bsky.social for shaping my trajectory as a researcher
May 5, 2025 at 8:28 PM
A huge shoutout to my mentors who have supported and shaped my research! Esp. grateful to my postdoc advisor @mohitbansal.bsky.social for helping me grow along the whole spectrum of PI skills, and my PhD advisor @vandurme.bsky.social for shaping my trajectory as a researcher
Looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD students for Fall 2026 across a range of connected topics (details: esteng.github.io) and plan on recruiting interns for Fall 2025 as well.
Elias Stengel-Eskin
Postdoctoral Research Associate, UNC Chapel Hill
esteng.github.io
May 5, 2025 at 8:28 PM
Looking forward to continuing to develop AI agents that interact/communicate with people, each other, and the multimodal world. I’ll be recruiting PhD students for Fall 2026 across a range of connected topics (details: esteng.github.io) and plan on recruiting interns for Fall 2025 as well.
Links:
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
Teaching Models to Balance Resisting and Accepting Persuasion
Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT allows us to use data generated from dialogues between smaller 7-8B models for training much larger 70B models. Moreover, PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates across two domains (trivia and commonsense QA). We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
arxiv.org
April 29, 2025 at 5:52 PM
Links:
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
1⃣ arxiv.org/abs/2410.14596
2⃣ arxiv.org/abs/2503.15272
3⃣ arxiv.org/abs/2409.07394
With awesome collaborators @mohitbansal.bsky.social, @peterbhase.bsky.social, David Wan, @cyjustinchen.bsky.social, Han Wang, @archiki.bsky.social
📆 04/30 2PM: Teaching Models to Balance Resisting and Accepting Persuasion
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
April 29, 2025 at 5:52 PM
📆 04/30 2PM: Teaching Models to Balance Resisting and Accepting Persuasion
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
📆 05/01 2PM: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
📆 05/02 11AM: AdaCAD: Adaptively Decoding to Balance Conflicts between Contextual and Parametric Knowledge
Kudos to Atin Pothiraj on leading this project, with @jmincho.bsky.social and @mohitbansal.bsky.social
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
GitHub - atinpothiraj/CAPTURe
Contribute to atinpothiraj/CAPTURe development by creating an account on GitHub.
github.com
April 24, 2025 at 3:14 PM
Kudos to Atin Pothiraj on leading this project, with @jmincho.bsky.social and @mohitbansal.bsky.social
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
Code: github.com/atinpothiraj...
@hf.co Dataset: huggingface.co/datasets/ati...
Paper: arxiv.org/abs/2504.15485
By testing VLMs’ spatial reasoning under occlusion, CAPTURe highlights an unexpected weakness. We analyze this weakness by providing the model with additional information:
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
April 24, 2025 at 3:14 PM
By testing VLMs’ spatial reasoning under occlusion, CAPTURe highlights an unexpected weakness. We analyze this weakness by providing the model with additional information:
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
➡️ Providing object coordinates as text improves performance substantially.
➡️ Providing diffusion-based inpainting also helps.
Interestingly, model error increases with respect to the number of occluded dots, suggesting that task performance is correlated with the level of occlusion.
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
April 24, 2025 at 3:14 PM
Interestingly, model error increases with respect to the number of occluded dots, suggesting that task performance is correlated with the level of occlusion.
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
Additionally, model performance depends on pattern type (the shape in which the objects are arranged).
We evaluate 4 strong VLMs (GPT-4o, InternVL2, Molmo, and Qwen2VL) on CAPTURe.
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
April 24, 2025 at 3:14 PM
We evaluate 4 strong VLMs (GPT-4o, InternVL2, Molmo, and Qwen2VL) on CAPTURe.
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
Models generally struggle with multiple aspects of the task (occluded and unoccluded)
Crucially, every model performs worse in the occluded setting but we find that humans can perform the task easily even with occlusion.
We release 2 splits:
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
April 24, 2025 at 3:14 PM
We release 2 splits:
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
➡️ CAPTURe-real contains real-world images and tests the ability of models to perform amodal counting in naturalistic contexts.
➡️ CAPTURe-synthetic allows us to analyze specific factors by controlling different variables like color, shape, and number of objects.
CAPTURe = Counting Amodally Through Unseen Regions, which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object that blocks parts of the scene).
This needs pattern recognition + counting, making it a good testbed for VLMs!
This needs pattern recognition + counting, making it a good testbed for VLMs!
April 24, 2025 at 3:14 PM
CAPTURe = Counting Amodally Through Unseen Regions, which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object that blocks parts of the scene).
This needs pattern recognition + counting, making it a good testbed for VLMs!
This needs pattern recognition + counting, making it a good testbed for VLMs!