Lightnews — Scholar-powered news

Light up
your news

Create account Sign in

About Privacy Terms Help

Richard D. Morey

Richard D. Morey

@richarddmorey.bsky.social

5.4K followers 820 following 530 posts

Statistics, cognitive modelling, and other sundry things. Mastodon: @richarddmorey@tech.lgbt
[I deleted my twitter account]

Posts Replies Media Videos

Richard D. Morey

@richarddmorey.bsky.social

my knowledge of Hume has unexpectedly come in handy for my fire safety training test

Screenshot of a quiz question:

Is the following statement True or False?

'A university building that has never had a fire or similar emergency is unlikely to have one in the near future'.

October 31, 2025 at 10:25 AM

Richard D. Morey

@richarddmorey.bsky.social

Also - contrast b/w the response when I advocate teaching R instead of SPSS -- "No hurry, let's not rush into it" (still waiting) -- & others re: use of LLMs -- "It's inevitable, we're behind; need it implement it ASAP!" -- is telling. Learning to code is freeing. Overhyped LLMs create dependency.

Excerpt from Guest & van Rooij, 2025:

As Danielle Navarro (2015) says about shortcuts through us-
ing inappropriate technology, which chatbots are, we end up dig-
ging ourselves into “a very deep hole.” She goes on to explain:

"The business model here is to suck you in during
your student days, and then leave you dependent on
their tools when you go out into the real world. [...]
And you can avoid it: if you make use of packages
like R that are open source and free, you never get
trapped having to pay exorbitant licensing fees." (pp.
37–38)

October 4, 2025 at 9:01 AM

Richard D. Morey

@richarddmorey.bsky.social

For me this is a hard red line in psychological science. If you advocate the use of "silicon samples" you do not understand what it is we're supposed to be doing (and likely don't understand LLMs, or are a grifter). Luckily I haven't seen much of this among people I'd consider my peer group.

Except from Table 1 of Guest & van Rooij, 2025:

3) Displacement of Participants

“I can use AI instead of participants to perform tasks and generate data.”

The providence of the data used in these models indicates it is not ethically sourced, falling below standards for our discipline, involving sweatshop labour and no consent for private data used in experiments. The output can contain direct original input data (i.e. double dipping), but smoothed to remove outliers, conform to our pre-existing ideas of what it should look like (data fabrication), and all-round irreplicable. Psychology is meant to study humans, not patterns at the output of biased statistical models.

October 4, 2025 at 8:27 AM

Richard D. Morey

@richarddmorey.bsky.social

It is this bit in particular, that they infer from a small number of simulations. This is basically like saying "the problem isn't variance, it is the presence of big and small observations". But heterogeneity is continuous. If you create a "population" of noncentrality pars 50% just bigger than >

Text from: https://datacolada.org/67
In Summary
• P-curve is not biased by heterogeneity.
• It is biased upwards in the presence of both (1) low powered studies, and (2) a large share of extremely highly powered studies.

September 25, 2025 at 1:10 PM

Richard D. Morey

@richarddmorey.bsky.social

as though he knows how they should affect the test; he chooses a test that gives him what he wants. This, to us (S style) is backward. Back to tests in general: how do you tell what tests are good and what tests are *not good*, where you'd want to look for alternatives (or fix them)? (15/x)

Text from Simonsohn (2025)'s blog post, "[129] P-curve works in practice, but would it work if you dropped a piano on it?":

All else equal we would of course choose the method with higher power, Fisher over Stouffer. But all else is not equal. Stouffer has three properties that are appealing in the context of p-curve analysis:

Property 1 that makes Stouffer appealing

Stouffer is less sensitive to individual extreme p-values, but more sensitive to several moderately low values.

September 25, 2025 at 10:07 AM

Richard D. Morey

@richarddmorey.bsky.social

of the test, like we did with the P-curve. The W-test is shown in the attached figure. It is a Z test, but modified so that 80% of the alpha in the rejection region has been taken from the tails and moved around Z=0. "Hey, this is bad," you argue to the inventor of the W-test. They respond: (11/x)

The null distribution of a Z test statistics, with rejection regions for the test marked out. There are rejection regions in the tails, as usual, but also part of the rejection region is around Z=0.

September 25, 2025 at 10:07 AM

Richard D. Morey

@richarddmorey.bsky.social

"NO" they say. "The power is fine, it's just sub-optimal!" (note that this is a *smaller* power penalty than some of the ones we showed in our P-curve paper). (8/x)

Figure 4 from Morey and Davis-Stober, showing three power comparisons (Figures A,B, and C). Caption: "Figure 4: Power of tests EV and EV*. A: All K studies lambda = 1. B: K - 1 studies have lambda = 0, and 1 study has lambda = 36 (equivalent to delta = 0.6 if N_eff = 100). C: K - 4 studies have lambda = 0, and 4 studies have lambda = 8 (equivalent to delta = .28 if N_eff = 100).

September 25, 2025 at 10:07 AM

Richard D. Morey

@richarddmorey.bsky.social

But you happen to know that for this scenario, the Z test is the most powerful test. You decide to compare the W-test to the Z test. The blue (upper) curve shows the Z test power relative W-test power (lower; true power is solid green line). "Huh," you think; "let's get to the bottom of this." (7/x)

Two power curves, one (Z test) above the other (W-test). At their furthest apart, the Z test curve is about 25% higher than the W-test curve.

September 25, 2025 at 10:07 AM

Richard D. Morey

@richarddmorey.bsky.social

via simulation, that it works. First you test the Type I error rate using 100 simulations. Seems close to 0.05. So far, so good. Then you try changing the effect size and look at the power. It seems to increase smoothly. The plot below shows the resulting simulated power curve. Seems ok too! (6/x)

A simulated power curve (M=100 simulations for each effect size) with standard errors for standardized effect sizes (x axis) from 0 to 5. The curve increases smoothly from a power of 0.05 at delta=0 to about 1 for an effect size of 5.

September 25, 2025 at 10:07 AM

Richard D. Morey

@richarddmorey.bsky.social

FWIW, both Clint and I discussed some points in the paper w/Uri (can't tag him b/c he blocked me) in private some time ago. To be clear, neither we nor he is entitled to "first comment" on anything. But if you say you have a policy you're willing to ignore when you feel, you don't have that policy.>

September 23, 2025 at 4:45 PM

Richard D. Morey

@richarddmorey.bsky.social

A MacOS file copy window that says "Copying 2,031 items to 'Lectures'" with a progress indicator that says "37 KB of 56.5 MB - About 3 days"

September 23, 2025 at 8:12 AM

Richard D. Morey

@richarddmorey.bsky.social

Where @candicemorey.bsky.social and I are today

Cave Dale outside of Castleton, Derbyshire: a rocky ravine with steep sides

A screenshot from the movie The Princess Bride, just after Wesley and Buttercup have tumbled down the hill.

August 31, 2025 at 2:40 PM

Richard D. Morey

@richarddmorey.bsky.social

…and, uh, also Lammy is spending family time with fascists? We think the lack of a fishing license is the issue here?

BBC headline: “Lammy admits fishing without license on Vance trip” with image of Vance and Lammy recreationally fishing together

August 13, 2025 at 6:55 PM

Richard D. Morey

@richarddmorey.bsky.social

I can’t recall ever seeing “unsure” and “don’t know” as options in the same response scale…

Question in a survey. “How difficult would you say it currently is to park in this area?” Options: “very difficult, difficult, easy, very easy, unsure, don’t know”

August 13, 2025 at 12:06 PM

Richard D. Morey

@richarddmorey.bsky.social

No, I didn't sign out, I signed in!

Outlook: "You signed out of your account. It's a good idea to close all browser windows."

August 13, 2025 at 9:11 AM

Richard D. Morey

@richarddmorey.bsky.social

Yes, it is indeed a VERY weird suggestion. It is also literally the FIRST paragraph of the interpretation of their test in their app. You don't have to defend it. We critique it because they use it, and because others use it. Did you think we just made it up?

August 10, 2025 at 9:15 PM

Richard D. Morey

@richarddmorey.bsky.social

This is just untrue. The compound rule was the *definition* of their 2015 evidential value test. They used it in their 2017 paper vs Cuddy. Others use it; I've looked at hundreds of papers. I've added screenshots of literally the first two random papers I pulled from my database.

Screenshot of Table 1 from Simonsohn et al (2015).

Screenshot of use of compound test from Simmons and Simonsohn (2017).

Screenshot of the use of the compound rule in https://srcd.onlinelibrary.wiley.com/doi/10.1111/cdev.13850

Screenshot of the use of the compound rule in https://onlinelibrary.wiley.com/doi/10.1111/eth.13418

August 10, 2025 at 4:35 PM

Richard D. Morey

@richarddmorey.bsky.social

That makes sense to me, since the S of Sq also matters for the t test (e.g. I could blow up the SD by increasing one value). In our case we're combining the test statistics themselves, and we follow Birnbaum (1954) (in the image below, u_i is the p value corresponding to the ith test statistic).

Screenshot of the bottom of page 562 of Birnbaum (1954), "Combining Independent Tests of Significance"

2. A GENERAL CONDITION FOR ADMISSIBILITY OF
METHODS OF COMBINATION
The following condition is readily seen to be satisfied by each of the
proposed methods described above:
Condition 1: If H_0 is rejected for any given set of ui's, then it will
also be rejected for all sets of ui*'s such that ui* <ui for each i.

Any method of combination which failed to satisfy this condition
would seem unreasonable. In fact, it is not difficult to prove that the
best test of H_0 against any particular alternative H_B of the kind described above satisfies Condition 1. (A proof is given in the Appendix.)

August 10, 2025 at 8:03 AM

Richard D. Morey

@richarddmorey.bsky.social

but, counterpoint...I discovered papers like this, which are wonderful and nearly impossible to read; unless you're deep in that literature they take months to digest the proofs ("how did he get from step 2 to step 3...?!"). That bit makes me happy.

August 9, 2025 at 9:11 AM

Richard D. Morey

@richarddmorey.bsky.social

There's much, much more in the paper. Here are our conclusions: we suggest not using the p curve at all. Most of the tests are fatally flawed. The original 2014 "evidential value" test is ok, but still not much is known theoretically about combining test statistics across families. 14/?

Based on our analysis of the P-curve’s properties, we offer several concrete recommenda tions. First: do not use Simonsohn et al’s (2015) compound decision rule. This would reduce, but not fully eliminate, problems of nonmonotonicity in the evidence. Second: do not use the test EV∗, because the probit transformation is inappropriate for the random variables used in the test. Third: do not use tests LEV/LEV∗ or LS/LS∗ due to their extreme sensitivity. Fourth: do not use the “power” estimates from the P-curve, because these estimates are not generally consistent. Fifth: abandon misleading interpretations of the P-curve tests in terms of “skew” and “evidential value”; instead, focus on the null hypotheses that are actually rejected. We are left only with test EV (if properly interpreted). Test EV was eliminated from Simonsohn et al’s online app, but our online app computes it. We also suggest, however, temporarily avoiding this test until more is understood about

$admissibility with $F(1,\nu_2)$ statistics and combining across test statistic families. From a conceptual perspective, it is more challenging to offer concrete recommendations. We recommend that future work in this area focus on directly evaluating the higher-order properties of whole $p$ value distributions, rather than testing a simple transformed average of truncated $p$ values. Re-imaginings should also be based on explicit models of cheating behavior that the developers wish to detect. Given what is needed to improve the $P$-curve tests, we do not recommend their use in their current form. Their statistical properties are problematic and it is not clear what substantive conclusions they afford. Given the stated purpose of the $P$-curve---evaluating the trustworthiness of scientific literatures---the stakes are too high to use tests with such poor, or poorly-understood, properties. Users of the $P$-curve procedure may object on practical grounds: the tests seem to agree with what they suspect from a histogram of $p$ values. Although the tests are poorly constructed, the results are still driven by patterns in the data, and these patterns overlap with those one might notice in such a histogram. But if the justification of the method cannot rest on formal principles---and we argue the formal justification is shaky at best---and proponents of the method decide instead to justify conclusions via agreement with visual inspection, this raises the question of why the test was necessary in the first place. [continued on next image]$

As a final point, we suggest that meta-scientists be more skeptical of procedures like the $P$-curve in the meta-scientific literature. Papers introducing them are often light on statistical exposition, using metaphors and a few simulations to make sweeping arguments. Simulation is a powerful tool and can help build intuition, but it is not a substitute for formal analysis. Simulation may provide hints of problems with a procedure, but only if the simulator's formal knowledge helps guide the choice of simulations. A simulator might quit after running a few simulations that tell them what they think is true while problems remain uncovered. Given the implications of poor forensic procedures for science, all such procedures demand deeper formal scrutiny.

August 8, 2025 at 6:56 PM

Richard D. Morey

@richarddmorey.bsky.social

It should go without saying, but a signficance test should never go from significant to nonsignificant with increasing evidence. But theirs does; in fact, several times! This makes no sense for a test, but it is a result of the fact that...12/?

Plot showing how sets of values with decreasing p values (increasing test statistics) flip the p curve test from significance to nonsignificance several times as they increase: nonmonotonicity in the evidence.

August 8, 2025 at 6:56 PM

Richard D. Morey

@richarddmorey.bsky.social

Imagine a set of studies with (signficant) chi-square(1) test statistics. We'll fix all the values but 2, then systematically increase 2 of them 5 times (for 6 sets of results). Then we'll look at when their test is significant ("the results show evidential value") 11/?

Table showing 15 fixed values, and 2 values increasing, across 6 sets of values.

August 8, 2025 at 6:56 PM

Richard D. Morey

@richarddmorey.bsky.social

Take the 2014 "lack of evidential value" test (which we call "LEV"). The transformation is given below. If this is significant (large values of the summed test statistic), then we are supposed to infer a "lack of evidential value" or a "flat(ish) p value distribution". But look carefully. 4/?

Graph of the transformations for the Simonsohn's 2014 LEV and LS p curve tests.

August 8, 2025 at 6:56 PM

Richard D. Morey

@richarddmorey.bsky.social

These tests are distinguished by their transformations. The graphs below show the transformations from significant p values to chi-squared distributed values (given a null) in their 2014 paper. These transformed values are then summed, and compared to an appropriate null, then an inference. 3/?

Graphs showing transformations of p values for tests EV, LEV, and LS in Simonsohn et al (2014).

August 8, 2025 at 6:56 PM

Richard D. Morey

@richarddmorey.bsky.social

The P-curve works by transforming signif. P-values in such a way that if the p values were uniform, the transformed values would have a distribution that can easily be summed (in a 2014 paper, chi-square; in a 2015 paper, normals). They develop 5 P-curve tests in total, for different purposes. 2/?

Animation of the probit transformation of uniformally-distributed values to normally-distributed values

August 8, 2025 at 6:56 PM