Lightnews — Scholar-powered news

Guilherme Almeida

@almeida2808.bsky.social

That’s a great point! The study also included legal rules, like one prohibiting shooting at deers. The design doesn’t allow us to look at the two sets of rules separately, but that would be a good direction for future research.

April 26, 2025 at 9:29 AM

Guilherme Almeida

@almeida2808.bsky.social

But the usual caveats should still be in place. For instance, even with temperature calibration, models still showed diminished diversity of thought when compared to humans.

Comments are more than welcome!

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

Overall, this suggests that the models are doing something more than mere memorization and that we could potentially learn about the likely reactions humans would have to novel stimuli by looking at LLM responses. 13/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

Interestingly, LLMs diverge here. GPT-4o and Llama 3.2 90b were not affected by the time pressure manipulation, but Claude 3 and Gemini Pro were. Moreover, the latter were similar to humans in that they relied more on text under forced delay than under time pressure. 12/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

The cool thing, of course, is that you can't really put LLMs under time pressure or forced delay (at least not with the public APIs). Thus, we're just either telling the model that it should respond within 4 seconds or that it must wait at least 15 seconds. 11/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

That depends at least in part on whether you think that the time pressure manipulation is inducing a bias or not. But you could argue either that competent concept application requires sensitivity to time constraints, or that time constraints elicit bias by restricting processing. 10/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

For Study 2, we decided to try something different. Among humans, we know that time pressure leads to more purposivism and a forced delay leads to more textualism. This could be read as either a context-sensitive feature of the concept of rule or as a bias. What would competent LLMs do? 9/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

Even more surprisingly, the same thing was true for all the models we tested: all of them were less textualist on the new stimuli when compared to the old stimuli. We interpret this to be evidence of conceptual mastery. Even subtle differences between stimuli are tracked by current LLMs. 8/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

The human data was surprising in that it revealed a significant difference between old and new vignettes. We didn't expect there to be any difference, but participants relied on text to a lesser extent on new vignettes when compared to old vignettes 7/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

To deal with (2), we first collected new data from humans. We then computed the standard deviation in each cell of our 2 (text) x 2 (purpose) x 4 (scenario) x 2 (new vs. old) design (total: 32 cells) and selected the temperature for each model that minimized the mean squared error between SDs. 6/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

To address issue (1), we created new vignettes that were supposed to match up perfectly with those in an earlier paper (doi.org/10.1037/lhb0...), changing just the exact words used. If models are just memorizing, they wouldn't be able to generalize to the new stimuli (although that's debatable) 5/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

Temperature is a parameter controlling the extent to which models will prioritize their best answer. Previous research sometimes set temperature to 0, driving models to nearly-deterministic results, while others vary it in somewhat arbitrary ways. We think there is a better way to do this! 4/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

2) Even when the significance patterns are similar, LLMs tend to show diminished diversity of thought, (see arxiv.org/abs/2302.07267) that is, different runs of the same model show much less variance in response to a fixed stimuli than a human sample. But LLM-APIs allow us to adjust that. 3/14

March 11, 2025 at 9:23 PM

Guilherme Almeida

@almeida2808.bsky.social

Previous work has shown that LLMs respond to stimuli in roughly the same way as humans. Usually, those papers compare the responses generated by LLMs with previously published human results. The issue is that LLMs could achieve this result through memorization. So, we need new stimuli. 2/14

March 11, 2025 at 9:23 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news