Lightnews — Scholar-powered news

Ashton Six

@ashtonsix.com

14 followers 220 following 7 posts

Research Engineer (software), with interests in superoptimisation, fast integer compression, and indexing for OLAP

Posts Replies Media Videos

Ashton Six

@ashtonsix.com

I want bare metal instances that can launch within 2-3 seconds, for a better (local dev <-> remote execution) REPL workflow

vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight

January 17, 2026 at 6:28 PM

Ashton Six

@ashtonsix.com

i've found some success sticking to SIMD-friendly scalar patterns

i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)

or use SIMDe, that works too

January 17, 2026 at 2:34 AM

Ashton Six

@ashtonsix.com

This feels like a continuation of the reduction operators introduced in Blackwell's TMA (cp.reduce.async.bulk). Fun fact! Data movement often dominates power usage vs compute because of physics: thicker longer wires = more power needed to transmit each bit. Makes a lot of sense to optimise here.

1. Introduction — PTX ISA 9.1 documentation

docs.nvidia.com

January 17, 2026 at 2:02 AM

Ashton Six

@ashtonsix.com

Mmm! Nice corollary: software optimisations for prefix sums (re-parenthesizing) generalise across associative ops: +, ^, prefix-of-prefix.

I made a thread about it: bsky.app/profile/asht...

Ashton Six @ashtonsix.com · 19h

I got SOTA (L1-hot, SIMD) on prefix sum by ADDING instructions (7.7 GB/s → 19.8 GB/s). Consider:

for i = 0..n: out[i] = out[i-1] + in[i]

This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/

January 17, 2026 at 1:31 AM

Ashton Six

@ashtonsix.com

Full write-up, implementation (NEON) and benchmark results (Graviton4) here: github.com/ashtonsix/pe...

I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3

perf-portfolio/delta at main · ashtonsix/perf-portfolio

HPC research and demonstrations. Contribute to ashtonsix/perf-portfolio development by creating an account on GitHub.

github.com

January 17, 2026 at 12:55 AM

Ashton Six

@ashtonsix.com

The ILP trick:

# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...

# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...

By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/

January 17, 2026 at 12:55 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news