Lightnews — Scholar-powered news

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

With GDM friends Adam Fisch, @jonathanberant.bsky.social, Alekh Agarwal, and special guest Anastasios Angelopoulos.

June 10, 2025 at 3:24 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

We offer cost-optimal policies for selecting which rater should annotate which examples, which link the cost, the annotation noise, and the *uncertainty* of the cheaper rater.

June 10, 2025 at 3:24 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

An ablation reveals the importance of mechanism design: when the helper identities are known to the asker during training (CSP-DeAnon), calibrated hedging is no longer learned.

calibration of p(answer), which is learned only when the helper identity is anonymized

March 24, 2025 at 3:39 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

In practice, collaborative self-play + reinforced self-training (ReST) lead to improved task performance, better calibration of confidence markers, and more efficient tool use.

calibration curves for tool use, showing that collaborative self play teaches when to use the retrieval tools

March 24, 2025 at 3:39 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

A bit of game theory can help explain when this can work: we model the setup as a game of public utility provision, where the public utility is the extra information provided by the costly retrieval action. The game has a unique equilibrium when the tools are sufficiently distinct (or both bad).

illustration of the equilibria of the formal model of costly information provision

March 24, 2025 at 3:39 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

Because the identity of each helper is hidden from the asker, it is forced to rely on confidence signals when faced with incompatible answers from the helpers. Maximizing effort-penalized accuracy of the full rollout can teach the LLM to use these confidence markers correctly.

an example rollout, in which the asker receives contrasting advice from its helpers, and must rely on their confidence to find the accurate response

March 24, 2025 at 3:39 PM

Reposted by Jonathan Berant

Jacob Eisenstein

@jacobeisenstein.bsky.social

We focus on two capabilities: knowing when to use a costly retrieval tool, and hedging non-confident answers. To teach these capabilities, we create a small multi-agent society, in which two "helpers" can use specialized retrieval tools to pass information back to an "asker"

schematic illustrating the collaborative self play scenario

March 24, 2025 at 3:39 PM

Jonathan Berant

@jonathanberant.bsky.social

arxiv.org/abs/2502.04510

Heterogeneous Swarms: Jointly Optimizing Model Roles and Weights for Multi-LLM Systems

We propose Heterogeneous Swarms, an algorithm to design multi-LLM systems by jointly optimizing model roles and weights. We represent multi-LLM systems as directed acyclic graphs (DAGs) of LLMs with t...

arxiv.org

March 23, 2025 at 8:44 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

I had a lot of fun working on this with Aya Meltzer-Asscher and @jonathanberant.bsky.social .
We will soon release our materials, human results, LLM results and all the cool images the models produced on our sentences.
arxiv.org/abs/2502.09307

When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models

Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs' and humans' language processing. In this paper, we conduct a detailed c...

arxiv.org

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

One intriguing follow-up: some component of the sentence understanding cognitive model fails on GP sentence. Is this component also present in LLMs? If not, then why so many LLMs are influenced by our manipulations in the same way humans are?

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

There are many more cool insights you can find in our paper.
One takeaway from this paper for the psycholinguistics community: run your reading comprehension experiment on LLM first. You might get a general idea of the human results.
(Last image I swear)

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

These experiments replicated the results from the sentence comprehension one: our manipulations had the same effect on the paraphrase or drawing correctness as they had on the sentence comprehension task.
In this image: While the teacher taught the puppies looked at the board.

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

We also ran two additional experiments with LLMs that are challenging to perform on humans.
1. We asked the LLM to paraphrase our sentence
2. We asked text-to-image models to draw the sentences
In this image: While the horse pulled the submarine moved silently.

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

To answer our second question, we ran the same sentence comprehension experiment we ran on humans with over 60 LLMs.
We found that LLMs also struggle with GP sentences and that, interestingly, the manipulations we did to test our hypotheses impacted LLMs as they did with humans

March 12, 2025 at 7:12 PM

Reposted by Jonathan Berant

amouyalsamuel.bsky.social

@amouyalsamuel.bsky.social

In our latest paper with Aya Meltzer-Asscher and @jonathanberant.bsky.social, we try to answer both these questions.
We devise hypotheses explaining why GP sentences are harder to process and test them. Human subjects answered a reading comprehension question about a sentence they read.

March 12, 2025 at 7:12 PM

Jonathan Berant

@jonathanberant.bsky.social

In RMs trained on the helpfulness dataset we saw outputs tend to be a list of bullet points. In tl;dr summarization similar length and more copying (like others). Using entailment as a RM makes outputs shorter
arxiv.org/abs/2312.09244
Preference data distribution doesn't explain this btw

Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

Reward models play a key role in aligning language model applications towards human preferences. However, this setup creates an incentive for the language model to exploit errors in the reward model t...

arxiv.org

January 14, 2025 at 1:57 AM

Jonathan Berant

@jonathanberant.bsky.social

Sure, but the take that "we should allow people to publicly make such comments because otherwise we would never know that the problem exists" seems odd to me, and I assume that is not your claim.

December 14, 2024 at 6:54 PM

Jonathan Berant

@jonathanberant.bsky.social

I haven't either don't know if such discussions exist or not.

December 14, 2024 at 6:46 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news