Ashton Six
ashtonsix.com
Ashton Six
@ashtonsix.com
Research Engineer (software), with interests in superoptimisation, fast integer compression, and indexing for OLAP
I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow

vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
January 17, 2026 at 6:28 PM
i've found some success sticking to SIMD-friendly scalar patterns

i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)

or use SIMDe, that works too
January 17, 2026 at 2:34 AM
This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.
1. Introduction — PTX ISA 9.1 documentation
docs.nvidia.com
January 17, 2026 at 2:02 AM
Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.

I made a thread about it: bsky.app/profile/asht...
I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s → 19.8 GB/s). Consider:

for i = 0..n: out[i] = out[i-1] + in[i]

This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/
January 17, 2026 at 1:31 AM
Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...

I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3
perf-portfolio/delta at main · ashtonsix/perf-portfolio
HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.
github.com
January 17, 2026 at 12:55 AM
The ILP trick:

# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...

# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...

By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/
January 17, 2026 at 12:55 AM