Jamie Cummins
jamiecummins.bsky.social
Jamie Cummins
@jamiecummins.bsky.social
Currently a visiting researcher at Uni of Oxford. Normally at Uni of Bern.
Meta-scientist building tools to help other scientists. NLP, simulation, & LLMs.
Creator and developer of RegCheck (https://regcheck.app).
1/4 of @error.reviews.
🇮🇪
My master thesis file name on my old university's thesis archive site still makes me chuckle.
October 30, 2025 at 12:21 PM
Some of the questions used for evaluation explicitly allude to the "5 Bits" structure, but again, this wasn't included in the prompt. If one were to build a software based on LLMs to try to create Science articles, it would look very different to this.
September 22, 2025 at 10:20 PM
Science writers, as the white paper elucidates, use an article structure called the "5 Bits" (screenshot 1). The prompts given to the LLM (screenshot 2) do not specify this. They do not provide good examples (as in one- or few-shot prompting). They generally do not follow prompting best-practices.
September 22, 2025 at 10:20 PM
The LLM-based samples also varied substantially in their estimates of the between-scale correlation. The blue line the point estimate for the correlation in the human data (r = 0.26).
September 18, 2025 at 7:56 AM
The silicon samples varied a lot in terms of how closely they modeled the response distribution of scales in the human data. But they were generally not good.

See the shaded blue area in the two plots? That covers the 95% interval for where bootstrapped human data falls.
September 18, 2025 at 7:56 AM
All of the silicon sample configurations were only, at best, moderately correlated with the human data when it came to preserving the ranking of participants. And many of them were negatively correlated with the human data.
September 18, 2025 at 7:56 AM
I then mapped out some analytic decisions to look at. Because I had neither infinite time nor infinite money, I looked at just four:
(1) The model used;
(2) The temperature OR reasoning effort hyperparameter setting;
(3) The demographic info provided;
(4) The way items were presented to the model.
September 18, 2025 at 7:56 AM
But here’s the problem: creating a silicon sample isn’t one method. There are so, so many analytic decisions that need to be made when generating these samples. I list some in this table from the preprint, but this is very much nonexhaustive.
September 18, 2025 at 7:56 AM
Waiting for my preprint to be accepted, so in the meantime a teaser: here's what happens when you try to estimate a between-scale correlation based on LLM-generated datasets of participants, while varying 4 different analytic decisions (blue is the true correlation from human data):
September 17, 2025 at 12:53 PM
OpenAI recently released GPT-5, and its smaller derivatives, GPT-5-mini and GPT-5-nano.

On paper, GPT-5-nano is much cheaper than GPT-5-mini: $0.40 per million output tokens for nano vs. $2 per million for mini.

But behind the scenes, nano is secretly costing the user almost as much as mini. 🧵
August 18, 2025 at 9:30 AM
August 5, 2025 at 5:12 PM
Hello from Seoul!
July 12, 2025 at 4:10 AM
First title is great but has this vibe:
July 7, 2025 at 10:19 PM
Independent of who authored the essay, the model exhibited effects of basically identical magnitude across the three studies.

Effects were not caused by dissonance between authoring a valenced essay and giving a rating; they were due to the general effects that the context window has on output.
July 7, 2025 at 1:39 PM
Secondly, we note that the effects do not require "dissonance" to be explained. We ran three studies which extended the authors' CD paradigm (using choice condition only).

The key ingredient: we varied the authorship of the essay generated, being tagged as created by either the model or the user.
July 7, 2025 at 1:39 PM
But if we set the weights of " once", " Once", "once", and "Once" to be very low (see image), the most probable next token on average is "in" (around 40% on average).

In some runs you'll still see "Once" pop up as the most probable, but it comes up wayyyyy less frequently.
July 3, 2025 at 4:39 PM
You can also ask other questions, like: given the input "Once upon a time there was a magical ", what is the probability that the next token is "garden"?

You can find this out! In GPT-4o with temperature = 1, it averages around 0.7%.
July 3, 2025 at 4:39 PM
Post an unusual sign if you feel like it.

Many years later and I still think about this sign I saw at a Subway.
June 28, 2025 at 12:11 PM
Reading this paper, you'll see the authors have transcripts in their supplementary materials. Open the transcripts in Word, and you're greeted with this: almost 1,500 pages of individually copy-pasted chats.

Because the entire study was done through the chat interface.
June 23, 2025 at 4:31 PM
I am once again asking: if you or someone you know (i) knows the SAS programming language, (ii) has some degree of familiarity with social psychology, and (iii) wants to be paid to review a paper as part of @error.reviews, please get in touch with me!
April 27, 2025 at 9:35 AM
Busy few days in Leipzig/Berlin giving a couple of talks and meeting many great colleagues. And after many years I met @dingdingpeng.the100.ci in person for the first time!
April 2, 2025 at 5:02 AM
The "pub"? Hey fellas, the "pub"! Well, ooh la di da, Mr. French Man.

Well what do you call it?
February 12, 2025 at 9:27 PM
Arrived in Budapest for a quick work trip. What a great winter city!
February 12, 2025 at 3:30 PM
Wrapping up a successful #PSE7 here in Bern!

A personal highlight was getting a pic of the largest in-person gathering to-date of the 100% CI extended universe.

@ianhussey.bsky.social @malte.the100.ci @ruben.the100.ci @annemscheel.bsky.social @taymalsalti.bsky.social @scientificdiscovery.dev
February 7, 2025 at 9:20 PM
I love the OSF, but this part of the UI is hugely flawed and has caused this type of delay for many colleagues I know.

This design feature really falls at the "make it easy" hurdle.
January 20, 2025 at 2:03 PM