vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
vs. fast launching containers (eg, Cloud Run), bare metal gives me more reliable benchmark measurements and the ability to bring tools like Nsight
i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)
or use SIMDe, that works too
i get loop order right (polyhedral analysis), add hints like `#pragma omp simd`, and run at -O3: that's _usually_ enough. you can check output with -S (gives readable ASM)
or use SIMDe, that works too
I made a thread about it: bsky.app/profile/asht...
for i = 0..n: out[i] = out[i-1] + in[i]
This SUCKS, because out[i] must wait on out[i-1]. There's an unbroken dependency chain which disrupts Instruction Level Parrallelism (ILP). 1/
I made a thread about it: bsky.app/profile/asht...
I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3
I love solving these kinds of performance puzzles—and I'm currently available for hire! Reach out if interested 😊. 3/3
# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...
# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...
By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/
# Local prefix sums
out[0..3] = prefix(in[0..3])
out[4..7] = prefix(in[4..7])
...
# Late carry broadcast (redundant compute)
out[4..7] += out[3];
out[8..11] += out[7];
...
By delaying the carry we allow the CPU to compute all local prefix sums in parallel, >doubling throughput. 2/