Gegell
gegellibu.bsky.social
Gegell
@gegellibu.bsky.social
All of the previous implementations & an exhaustive test suite which verifies all 2^16 inputs is available on shadertoy: www.shadertoy.com/view/wfVyz3
www.shadertoy.com
December 6, 2025 at 10:12 PM
Also it turns out that for the 3D variant the additional space between the active bits can be used even better than for the 2D version, reducing the combine step of morton values from a 512^3 grid to only 5 operations!

And for Part1By2 we have enough space to also use the multiplies there.
December 6, 2025 at 10:12 PM
Additionally, we can make use of the word boundaries to remove the last 0xffff_0000 mask. Shifting left 1 discards any bits generated past the 16th bit. Bits 15-8 which retain some values get cleared when finally shifting back towards the LSB.

The shift can combined into any of the multiplications.
December 6, 2025 at 2:59 AM
The same thing can be done with the original code (however we collect the bits towards the MSB instead of LSB this time, flipping >> to <<). By doing this the number of operations can be reduced to only 10.
December 6, 2025 at 2:59 AM
For the above example if x, (x << a) and (x << b) all have bits in non-overlapping regions, we can safely replace the | with a +. Remembering that x << n = x * (2**n) we then write x + x * (2**a) + x * (2**b) = x * (1 + 2**a + 2**b) effectively factoring it to a single multiplication.
December 6, 2025 at 2:59 AM
Note that your `decode_morton2_65536x65536` should use
`_compact_1_by_1(uvec2(x, x >> 1))` instead of `uvec2(x, x<<1) `. Otherwise you're discarding the highest bit and multiplying one value unintentionally by 2.
December 5, 2025 at 8:40 PM