Lightnews — Scholar-powered news

Xan Gregg

@xangregg.bsky.social

I'm not sure there is a good use of this beeswarm clamping treatment, but it's surely overdone here. Looking deeper, there are a large number of zeros which stress any distribution view. I tried a few in a #dataviz blog post. rawdatastudies.com/2025/11/22/b...

Six beeswarm charts, all with the same width

Six box plots with very truncated lower parts (only one has a lower whisker)

Six packed density dot plots for non-zero response values, and six square grids of dots for response values at zero.

November 23, 2025 at 3:09 PM

Xan Gregg

@xangregg.bsky.social

Trying the same thing with JMP 19's constrained p-splines. Nicely similar to the R GAMs, but JMP's CIs are bootstrapped instead of analytical, which is why they're not symmetric. #dataviz

Scatter plot with two smoothers for a 32-row cars data set. X is HP and Y is displacement. One smoother is constrained to be monotonic and flattens out at the end where the other one dips down.

November 13, 2025 at 1:30 PM

Xan Gregg

@xangregg.bsky.social

My "discovery" was discussed 30 years earlier in a research note by Martin Mächler (see lowess.ps in his unpublished manuscripts folder) people.math.ethz.ch/~maechler/

screenshot of research note with first paragraph. Text reads:

Robustifying a Local Nonparametric Regression Estimator
Martin Maechler
May 1989
April 1992
This technical note is basically section 5.5 from my PhD thesis Machler The
goal is to make the following ndings more widely available I assume here that the
reader is familiar with Cleveland the rst journal pap er to introduce LOWESS
Note that since Cleveland and several coworkers have continued to research and improve
the local regression methodology notably allowing for multiple carriers and also working
out methods for inference see Cleveland and Devlin or chapter of Chambers
and Hastie Also the name of LOWESS has b een changed to loess However
nothing has been changed in the algorithm used to robustify LOWESS/loess and here I
am only considering these robustness properties Therefore all the subsequent references
to LOWESS equally apply to loess.

November 6, 2025 at 5:57 PM

Xan Gregg

@xangregg.bsky.social

Did you know (robust) loess fitting can fail if the data is already smooth? I made a notebook that shows the flawed fit (red) along with a possible improvement (blue), using Cleveland's original demonstration curve. #stats
observablehq.com/@xangregg/lo...

Screenshot of an interactive Observable notebook. Includes a graph containing data points and curves. A gray curve is the function used to generate the data by adding a small amount of noise. A red curve is a robust loess fit. A blue curve is a robust loess fit using a new outlier rule.

November 6, 2025 at 5:50 PM

Xan Gregg

@xangregg.bsky.social

New blog post looking at some recently-shared NCAA football player data. The scatterplot is percent drafted to NFL against average player high school rating by college. Also trying out inward-jittered, smoothed dot plots.
rawdatastudies.com/2025/10/26/n...

Scatter plot with a fitted regression spline of "average play rating" on X and "NFL draft rate" on the Y.

Dot chart of player rating on the X axis and Draft status (true or false) on the Y axis. values are jittered inward in the Y direction, resulting in two smoothed Wilkinson-like hexagonal-grid dot plots with the top one inverted.

October 27, 2025 at 2:14 PM

Xan Gregg

@xangregg.bsky.social

Dot plot #dataviz comparison: ratings of FIDE chess Grand Masters via Tidy Tuesday.
1 Nearest stacks (Wilkinson)
2 Smoothed stacks
3 Smoothed hexagonal grid
4 Exact position (beeswarm)
Smoothing trades delta-x for spikiness (deviation from kernel density estimate).

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed in the nearest stack.
Source data: https://github.com/rfordatascience/tidytuesday/blob/main/data/2025/2025-09-23/readme.md

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed in the nearest stack or the next nearest if it helps avoid spikes.

Dot plot of FIDE Grand Master chess ELO ratings. Most are between 2400 and 2600. Each dot is placed at its true x position, adjusted vertically to avoid overlap.

September 27, 2025 at 12:17 PM

Xan Gregg

@xangregg.bsky.social

JMP 19 is out (free trial available), and I wrote a blog post about the main things I worked on. Constrained smoothers, jitter options, easier arrows, parallel y axes, ... #dataviz
community.jmp.com/t5/JMPer-Cab...

Image from linked blog post showing hexagonal jitter example. Penguin body mass colored by sex.

September 19, 2025 at 12:53 PM

Xan Gregg

@xangregg.bsky.social

Yay, I was able to reproduce the lines in this chart precisely from the raw data. The original shows summary dots where mine shows raw data dots, and at a couple zoom levels. The power of statistics; signal and noise. www.nature.com/articles/s41...

Figure 2a from Nature paper https://www.nature.com/articles/s41586-025-09321-3. Shows a line chart over a bubble chart.

Recreation of previous chart but showing raw data dots instead of summarized bubbles.

Same as previous chart, but with the Y axis zoomed out a bit.

Same as previous chart, but with the Y axis zoomed out to show all the data (and fitted lines look almost flat).

August 22, 2025 at 7:26 PM

Xan Gregg

@xangregg.bsky.social

What to make of a paper that shares a ton of well-organized data and code for its charts, but not enough detail for analysis? PII concerns, maybe.
Curiously, these line charts are random data, suggesting steadier step counts. www.nature.com/articles/s41...

Excerpt from Figure 1 of https://www.nature.com/articles/s41586-025-09321-3, showing part of a US map with two inset line charts indicating the daily step counts before and after a move from Seattle to San Francisco, The lines are relatively flat, at around 6000 steps for Seattle and 6700 steps for San Francisco.

$Screenshot from a Python notebook shared with the paper. Code part reads: with plt.rc_context({'figure.autolayout': True}): fig, ax = plt.subplots(figsize=(4, 2)); pre_x = range(-35, -5); y = np.random.normal(loc=from_df.loc[from_df['from_loc'] == 'Seattle, WA', 'pre_avg'], scale=50., size=(len(pre_x), )); plt.plot(pre_x, y, lw=5., c='#aa3939'); plt.ylim(5800, 7000); ax.grid(False); for item in ([ax.xaxis.label, ax.yaxis.label] + ax.get_xticklabels() + ax.get_yticklabels()): item.set_fontsize(axis_fontsize); ax.set(xlabel=r'Days from Move $\left(t - t_{move}\right)$', ylabel='Daily Steps', xticks=range(-35, -4, 10)); fig.tight_layout() plt.savefig('../output/fig1b_subplots/seattle_from.png', dpi=600);$

August 20, 2025 at 12:48 AM

Xan Gregg

@xangregg.bsky.social

Here's the smoothed grid with dots colored by their values's ones digit (walkScore % 10), and a superposition attempt, with smoothed in gray. (I didn't quite get the walk score per dot width to be an exact number of pixels.) Hope these capture the diagnostic you're looking for.

Smoothed dot plot with dots colored according to their true values.

Overlaid smoothed and unsmoothed dot plots

August 17, 2025 at 8:59 PM

Xan Gregg

@xangregg.bsky.social

Quick dot plot #dataviz study with 2500 US city Walk Scores. Plain dot plot (exact because scores are integers), with smoothing (±1), and with hexagonal placement (±0.75). Data from www.walkscore.com

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Occasional spikes with one big one at 37.

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Smoothed pile heights.

Dot plot of Walk Scores (0-100 scale) of 2500 US cities. The bulk is centered around 25-45 range. Smoothed pile heights on a hexagonal grid.

August 17, 2025 at 4:57 PM

Xan Gregg

@xangregg.bsky.social

It can't be a ratio of the changes since the denominator could be very small, even 0. However, using (total + first)/(total+latest) is no good since base is so much bigger. It seems like some smoothing/annualizing is happening. Closest I could get was a 12-month cumulative error versus the total.

August 13, 2025 at 2:32 PM

Xan Gregg

@xangregg.bsky.social

Rare sighting of letter-values plots in the wild. Nicely described in the caption as "plots which first identify the median, then extend boxes outward, each covering half of the remaining data." n=2.9M, so regular box plots would be swamped with outliers. #dataviz
arxiv.org/pdf/2402.14583

Chart with 5 letter-values plots, a variant of box plots.

August 1, 2025 at 12:13 PM

Xan Gregg

@xangregg.bsky.social

The originals could serve as fodder for some #dataviz guides. When the zero-origin rule breaks down or when to use dots/lines instead of bars.

Bar charts from https://www.dailymail.co.uk/sciencetech/article-13739705/london-underground-hottest-line.html, showing average temperature for 10 years using bar charts with origins at 0°C. All bars in the 25 to 30 range, showing little variation at the scale.

July 23, 2025 at 12:25 PM

Xan Gregg

@xangregg.bsky.social

Great improvement sequence, but for me, it's harder to verify which categories are changing after putting their bars in separate groups. I see it's a trade-off with simplifying the coloring. Here's a try at sticking with the original ordering, at the cost of an imperfect time legend.

bar chart with 15 bars for 5 categories across 3 time periods each.

July 16, 2025 at 12:11 PM

Xan Gregg

@xangregg.bsky.social

This article by Don Wheeler has a good discussion of Grubbs' test and others. www.qualitydigest.com/inside/stati... [free reg reqd]
He's a control charts expert, which explains the sequence-based context and small data sizes.

Chart excerpted from https://www.qualitydigest.com/inside/statistics-column/some-outlier-tests-part-2-011121.html by Donald Wheeler showing several overlaid curves. Each shows the probability that outliers found are real versus data set size for several outlier tests.

June 30, 2025 at 7:22 PM

Xan Gregg

@xangregg.bsky.social

I need to do a full write-up. The green ones are my experimental inventions. One idea was that "shortest half" and related intervals would make good data-driven density intervals. They seem better for very skewed distributions like exponential, but maybe not so great in general.

14 one dimensional views of an exponential distribution of 5000 values.

June 30, 2025 at 7:11 PM

Xan Gregg

@xangregg.bsky.social

Round 2 of my 1D #dataviz experiment at xangregg.github.io/data-strips/.
I realized my adaptive outlier idea was already done as Grubbs' test, which I've adapted for non-gaussian moments.
Added a couple thirds-based views. Here's 5000 random normal samples plus 2 outliers. The green ones use Grubbs.

14 one-dimensional views of a data set of 5000 random normal values and two larger outliers. Traditional HDR and Box Plot have outlier tests that don't scale this high, but Grubbs' test does.

June 30, 2025 at 6:55 PM

Xan Gregg

@xangregg.bsky.social

Thanks! Overcoming HDR drawbacks was an initial motivation.
1) many false outliers for big data
2) no outliers for small data
3) regions can extend beyond the data
Still tuning, but the code is in JavaScript at github.com/xangregg/dat....

13 1-D views of a small data set having 10 values, 9 pulled from a random normal and one outlier. Classic HDR doesn't identify the outlier and overshoots the other values.

June 24, 2025 at 7:44 PM

Xan Gregg

@xangregg.bsky.social

Different data sets:
Old Faithful eruption intervals bimodal data,
Random bimodal
Random exponential
Random lognormal

Same panel of 1D displays as in parent post but for common Old Faithful eruption times data set.

Same panel of 1D displays as in parent post but for bimodal random normal data set.

Same panel of 1D displays as in parent post but for random exponential data set.

Same panel of 1D displays as in parent post but for random lognormal data set.

June 23, 2025 at 5:09 PM

Xan Gregg

@xangregg.bsky.social

6 old and 7 new 1D #dataviz. Trying Shortest Half with a twist: one break is allowed. Also half-sample mode & count-adaptive outlier thresholds. Here's 1000 random normal points plus two outliers. The green ones are new. Try it at xangregg.github.io/data-strips/

Screenshot of a web app displaying 13 one-dimensional horizontal views of the same data set. Summary of each:
1. Heatmap of fixed-sized intervals colored by relative count.
2. Kernel Density Estimation using a gaussian kernel and multiple bandwidths.
3. Rug plot with a vertical line at each data point, up to 2000 points.
4. Density Strip: KDE using color and a bandwidth multiplier of 50%.
5. Highest Density Regions plot with cut-off points at 0.0, 0.5, 0.95, and 0.99.
6. Highest Density Regions plot with cut-off points at 0.0, 0.5, and an adaptive outlier threshold based on a gaussian extrapolation of the middle region.
7. The shortest contiguous half that contains at least 50% of the data, applied iteratively to show 1/2, 1/4, 1/8, 1/16 and the shortest half mode.
8. Quantile regions at percentiles that correspond to equal intervals if the data is Gaussian. Each region would be one standard deviation wide.
9. Density Rug: Shortest regions of 20%, 50%, and 80% of the data, allowing for split regions penalized according to the Split Penalty. Other values are shown as a rug plot, except touching values are connected as a single region to avoid looking more dense than the shortest regions.
10. Shorth and Mode: Shortest half of the data as one or two contiguous intervals; any split is penalized according to the Split Penalty. A vertical line shows the "half sample mode" which is the iteratively applied shortest half.
11. Inner region shows Interquartile Range (IQR) of the data with a line at the median.
12. Inner region shows Interquartile Range (IQR) of the data with a line at the median. When the outer region extends beyond the common box plot whiskers (1.5 IQR), the endcaps are shown as arcs.
13. Box Plot: Inner region shows Interquartile Range (IQR) of the data with a line at the median.

June 23, 2025 at 5:02 PM

Xan Gregg

@xangregg.bsky.social

This bilinear fit prompted me to try out a p-spline idea. P-splines are additive models that minimize adjacent slope differences. What if we relax the weight on the difference at the knot with highest curvature and re-optimize? ... Not bad.

Time series chart of day of peak cherry blossoming each year over time for over 1000 years. A p-spline smoother shows the overall trend, which sharply declines after 1890. Data from the parent post. A reference line at 1890 corresponds to the change point from the parent post's paper.

June 11, 2025 at 6:59 PM

Xan Gregg

@xangregg.bsky.social

Adding to my collection of bold Data Availability statements. Malicious compliance, or am I missing something? alzres.biomedcentral.com/articles/10....

Picture of text from a paper reading:
"Data handling and statistical analysis
Data were analyzed blindly. Data are presented as the mean ± SEM. Statistical analysis was performed with SPSS 18.0 software. The test of Kolmogorov–Smirnov with the correction of Lilliefors was used to evaluate normal distribution and the test of Levene to evaluate the homogeneity of variance. Significance was analyzed by one-way ANOVA and Bonferroni’s multiple comparison post hoc test. Significance was considered when p < 0.05.
Data availability
No datasets were generated or analysed during the current study."

June 9, 2025 at 3:30 PM

Xan Gregg

@xangregg.bsky.social

Thinking a little more, using bars allows coloring each response instead of using lines in the penumbra region. And since the yellow values were not part of the survey (they're just US demographics), they should have a different encoding, esp. to not obscure the survey response values.

Stacked bar chart of data in previous post (paper Social penumbras predict political attitudes by Gelman and Margalit). Using shades of gray for each response and yellow lines for population demographics.

May 31, 2025 at 4:35 PM

Xan Gregg

@xangregg.bsky.social

That chart is mostly just from the survey responses, not the model, so their provided data works pretty well. Here's a quick stacked bar version, not trying to mimic their weighting and imputation. My data in ALT text if you want to experiment.

Stacked bar that remakes the circle area chart in Figure 2 of https://sites.stat.columbia.edu/gelman/research/published/penumbra.pdf.

Each bar corresponds to a survey topic and the lengths are the number of people in the category and the number of people who know people in that category, for friends, family and other acquaintances.

Raw data for this chart (which may be slightly different from original due to imputation and weighting steps).
Question,Group,Cumulative Pct
am,us,0.6%
am,fam,20.3%
am,ff,31.6%
am,acq,44.5%
un,us,4.7%
un,fam,31.9%
un,ff,44.0%
un,acq,52.7%
gl,us,3.6%
gl,fam,32.5%
gl,ff,53.4%
gl,acq,76.8%
ab,us,2.0%
ab,fam,3.8%
ab,ff,6.1%
ab,acq,9.9%
lost,us,4.2%
lost,fam,26.4%
lost,ff,37.6%
lost,acq,46.2%
mus,us,3.4%
mus,fam,3.0%
mus,ff,10.8%
mus,acq,29.2%
nra,us,2.0%
nra,fam,27.5%
nra,ff,36.4%
nra,acq,42.2%
gun,us,24.0%
gun,fam,63.2%
gun,ff,74.2%
gun,acq,79.0%
im,us,1.9%
im,fam,2.7%
im,ff,8.1%
im,acq,17.6%
welf,us,21.0%
welf,fam,30.0%
welf,ff,39.4%
welf,acq,51.8%
serh,us,25.0%
serh,fam,58.8%
serh,ff,69.5%
serh,acq,75.6%
ca,us,17.0%
ca,fam,30.2%
ca,ff,40.5%
ca,acq,47.0%
mort,us,6.6%
mort,fam,19.7%
mort,ff,27.5%
mort,acq,33.1%
noh,us,16.0%
noh,fam,33.0%
noh,ff,41.5%
noh,acq,47.6%

May 31, 2025 at 2:58 PM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news