Phil Swatton
philswatton.bsky.social
Phil Swatton
@philswatton.bsky.social
Work as a data scientist at the Alan Turing Institute, background in political science. Views my own and not necessarily shared by my employer.

https://philswatton.github.io/
That makes a lot of sense, thank you!
November 12, 2025 at 2:11 PM
Wilcoxon is an approach to it, but I guess my q is:

- If tests on individual datasets fail to reject the null (can't tell which is better on any given dataset)
- & a single test comparing accuracies across datasets rejects the null (A is better than B result across datasets)

What should I infer?
November 12, 2025 at 2:03 PM
Paper in the brackets I forgot to link is www.jmlr.org/papers/volum...
www.jmlr.org
November 12, 2025 at 1:42 PM
Do you:

1) interpret the results as being inconclusive on which classifier is better
2) interpret classifier A as being better than B

Obviously there is

3) Find some extra datasets w/ larger test sets, but I'm curious how people would approach the initial problem

(3/3)
November 12, 2025 at 1:40 PM
Classifier A has consistently better accuracy than classifier B on most test sets (say, 13/15). This is significant in a Wilcoxon signed rank test (approach advocated by ).

However, on most _individual_ points (say, 14/15), the 95% CIs on the accuracy on each dataset overlap. (2/3)
November 12, 2025 at 1:40 PM
Thank you!
November 7, 2025 at 6:51 PM
Hi Ben, I sent an email about the role but fear it may have ended up in your spam
November 7, 2025 at 6:18 PM
And good shout on DKs/would not votes - will try that out later today if I remember!
November 7, 2025 at 5:32 PM
If I squint hard enough, I might interpret it as established party vs not, in that it explains both the gap between Con & Ref on the one hand, and why Green voters are on the same part as Ref on the other, but I'm not fully convinced by it - e.g. why Lab & LD are more middling on the dimension
November 7, 2025 at 5:32 PM
Thank you - the paper looks fascinating, will add to my reading pile, thanks for sharing!
November 7, 2025 at 12:08 PM
Based on our recent discussion, maybe or maybe not of interest to

@mariosrichards.bsky.social
@ralphscott.bsky.social
@jack-bailey.co.uk
@heinzbrandenburg.bsky.social
November 7, 2025 at 11:57 AM
I've made the code for the ANES output available at: gist.github.com/philswatton/...
gist.github.com
November 7, 2025 at 11:55 AM
(this is for most but not all post-election feeling thermometers)
November 4, 2025 at 9:49 PM
Here's the equivalent distributions for ANES 2024, they look much spikier (but possibly you still get more out of it, if e.g. people use values ending 0 or 5, that's still 21 meaningful values vs 11 in 0-10 or 7 in 1-7)
November 4, 2025 at 9:49 PM
Made this example w/ the 2024 data, will upload a blog post or gist over the next couple of days:
November 4, 2025 at 9:42 PM
That's really interesting - what dataset is this from?
November 4, 2025 at 6:11 PM
Agreed - I'd be interested in seeing comparisons of 1-7, 0-10, and 0-100 disaggregated across ideological self placements vs warmth ratings too
November 4, 2025 at 5:43 PM
1-7 scales are on issues

0-100 are warmth, but you can recover a liberal-conservative dimension from them (I think I have a better plot somewhere):
November 4, 2025 at 5:33 PM
I don't know of any dataset that would enable comparison of different scales for warmth ratings though
November 4, 2025 at 5:31 PM
It's been a while and I don't have my laptop with me to check atm, but while I imagine a lot of respondents do use '50', '25', etc I think I recall correctly that there was enough variation between different stimuli to rate
November 4, 2025 at 5:31 PM
I'm not sure actually - I wouldn't use 0-100 for issue scales - when I wish for them it's more about having warmth ratings towards lots of different stimuli. I helped make a presentation to the RSS on the 2020 presidential w/ them, they were really interesting for nonmetric unfolding:
November 4, 2025 at 5:31 PM