Lightnews — Scholar-powered news

Cameron Jones

@camrobjones.bsky.social

150 followers 140 following 42 posts

Postdoc in the Language and Cognition lab at UC San Diego. I’m interested in persuasion, deception, LLMs, and social intelligence.

Posts Replies Media Videos

Cameron Jones

@camrobjones.bsky.social

There's lots more detail in the paper arxiv.org/abs/2503.23674. We also release all of the data (including full anonymized transcripts) for further scrutiny/analysis/to prove this isn't an April Fools joke.

The paper's under review and any feedback would be very welcome!

Large Language Models Pass the Turing Test

We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations s...

arxiv.org

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Thanks so much to my co-author Ben Bergen, to Sydney Taylor (a former RA who wrote the persona prompt!), to Open Philanthropy and to 12 donors on Manifund who helped to support this work.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

One of the most important aspects of the Turing test is that it's not static: it depends on people's assumptions about other humans and technology. We agree with
@brianchristian.bsky.social that humans could (and should) come back better next year!

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

More pressingly, I think the results provide more evidence that LLMs could substitute for people in short interactions without anyone being able to tell. This could potentially lead to automation of jobs, improved social engineering attacks, and more general societal disruption.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Did LLMs really pass if they needed a prompt? It's a good q. Without any prompt, LLMs would fail for trivial reasons (like admitting to being AI). & they could easily be fine-tuned to behave as they do when prompted. So I do think it's fair to say that LLMs pass.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Does this mean LLMs are intelligent? I think that's a very complicated question that's hard to address in a paper (or a tweet). But broadly I think this should be evaluated as one among many other pieces of evidence for the kind of intelligence LLMs display.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Turing is quite vague about exactly how the test should be implemented. As such there are many possible variations (e.g. 2-party, an hour, or with experts). I think this 3-party 5-min version is the mostly widely accepted "standard" test but planning to explore others in future.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

So do LLMs pass the Turing test? We think this is pretty strong evidence that they do. People were no better than chance at distinguishing humans from GPT-4.5 and LLaMa (with the persona prompt). And 4.5 was even judged to be human significantly *more* often than actual humans!

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

As in previous work, people focused more on linguistic and socioemotional factors in their strategies & reasons. This might suggest people no longer see "classical" intelligence (e.g. math, knowledge, reasoning) as a good way of discriminating people from machines.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

We also tried giving a more basic prompt to the models, without detailed instructions on the persona to adopt. Models performed significantly worse in this condition (highlighting the importance of prompting), but were still indistinguishable from humans in the Prolific study.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Across 2 studies (on undergrads and Prolific) GPT-4.5 was selected as the human significantly more often than chance (50%). LLaMa was not selected significantly more or less often than humans, suggesting ppts couldn't distinguish it from people. Baselines (ELIZA & GPT-4o) were worse than chance.

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

Participants spoke to two "witnesses" at the same time: one human and one AI. Here are some example convos from the study. Can you tell which one is the human? Answers & original interrogator verdicts in the paper...

You can play the game yourself here: turingtest.live

April 1, 2025 at 3:14 PM

Cameron Jones

@camrobjones.bsky.social

In previous work we found GPT-4 was judged to be human ~50% of the time in a 2-party Turing test, where ppts speak to *either* a human or a model.

This is probably easier for several reasons. Here we ran a new study with Turing's original 3-party setup

arxiv.org/abs/2503.23674

April 1, 2025 at 3:14 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news