camel-cdr.bsky.social
@camel-cdr.bsky.social
🐘 @camelcdr@tech.lgbt
I only slightly disagree with using segmented load/store transpose. If you need to transpose from memory fine, but if you need register to register going though memory isn't the best. I'd use vslide1up/down or in the future vpaire/vpairo: github.com/ved-rivos/ri...
riscv-isa-manual/src/zvzip.adoc at zvzip · ved-rivos/riscv-isa-manual
RISC-V Instruction Set Manual. Contribute to ved-rivos/riscv-isa-manual development by creating an account on GitHub.
github.com
October 23, 2025 at 10:51 AM
*correction: 0.5/0.5/2/4 for vector-scalar/immediate compares (0.5/2/4/8 for vector-vector)
September 25, 2025 at 5:59 PM
For the scalar instructions:
* 6-issue: add/sub/lui/xor/sll/shNadd/zext/clz/cpop/min/rotl/rev8/bext/...
* 3-issue: load/store
* 2-issue: fadd/fmul/fmacc/fmin/fcvt
* 1-issue: mul/mulh/feq/flt
* pipelined: fsqrt/fdiv: ~8.5, div/rem: 12-16
September 25, 2025 at 5:50 PM
My takeaway so far is to not be scared to use the segmented load/stores, and LMUL>1 permutes are good, but you probably want to avoid LMUL=8 ones when possible. I'll continue manually unrolling none-lane-crossing permutes. For LMUL>1 comparisons, it's better to use .vx/vi over .vv when possible.
September 25, 2025 at 5:50 PM
The vslide1up/vslide1down do scale perfectly, though, with 0.5/1/2/4. It's not in the benchmark, but I hope vslideup/vslidedown with immediate "1" also do.

We'll have to wait for the other microbenchmarks to get a more complete picture.
September 25, 2025 at 5:50 PM
* Ovlt behavior isn't supported, but I don't really care much about it

The only bigger negative thing I've seen so far is that the vslideup/vslidedown instructions don't scale linearly or close to linearly with LMUL, even for a small immediate shift amount like "3".
September 25, 2025 at 5:50 PM
* dual-issue vrgather, with good scaling: 0.5/1/8/30
* dual-issue vcompress, with OK scaling: 0.5/3/6/17 (I still think this could get close to linear)
* Fault-only-first loads seem to have no overhead
* Segmented load/stores look quite fast, even the more exotic ones like seg7
September 25, 2025 at 5:50 PM
* Most instructions have an inverse throughput of 0.5/1/2/4 for LMUL=1/2/4/8, even vslide1up/down, 64-bit vmulh, viota, vpopc and integer reductions
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)
September 25, 2025 at 5:49 PM
Third Way is an unfortunate name: en.wikipedia.org/wiki/Third_W...
August 24, 2025 at 3:25 PM
Reposted
So if you are currently involved with ISA-level decisions about inclusion of any pext/pdep-like instructions:

Please consider including SAG/inverse-SAG with bit-reversal of the goats.

No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
July 25, 2025 at 11:30 PM
It looks like the patent expires at the end of 2028. The earliest I could see a RVI extension ratified at this point is 2027, so it's definitely worth evaluating.

Also, the new diagrams are cool.
July 24, 2025 at 6:03 PM
I've only watched the last hour this far, but I quite liked your take on null-terminated strings.
C really has to be understood with its history in mind.
July 20, 2025 at 8:12 PM
Their V2 slides say, that they have a macro-op cache equivalent in size to a regular 32 KiB icache.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.
July 11, 2025 at 8:59 PM
Ohh, the talk recordings are on YouTube: www.youtube.com/watch?v=1lwz...
CBP2025 - Opening Remarks - Rami Sheikh
YouTube video by Rami Sheikh
www.youtube.com
June 28, 2025 at 9:22 AM
Reposted
I wrote a reference implementation for a SAG without bit reflection: github.com/clairexen/ed..., and I wrote a parametric SAG core for any bit width: github.com/clairexen/ed...
edu-sag/param.v at main · clairexen/edu-sag
Educational 8-Bit Sheep-And-Goats (SAG) Verilog Reference IP - clairexen/edu-sag
github.com
June 20, 2025 at 4:04 PM
>>> lut=np.array([ord('a'),0,ord('e'),0,ord('i'),0,0,ord('o'),0,0,ord('u'),0,0,0,0,0], dtype=np.uint8)
>>> inp=np.frombuffer(b"test128aeiou72761xjs",dtype=np.uint8)
>>> lut[(inp&31)>>1]==inp
June 16, 2025 at 2:19 PM
4x 16-bit: 120 u^2 63% utilized, 5GHz met (49 slack)

2x 32-bit: 120 u^2 65% utilized, 5GHz met (52 slack)

1x 64-bit: 153 u^2 64% utilized, 5GHz met (14 slack)

So subsetting on SEW really doesn't make much sense compared to a .vx subset.
June 12, 2025 at 11:46 AM
I got OpenROAD working and tested the bfly part of your implementation (so without decode) in a SIMD setup.

asap7, targeting 5GHz, 75% placement density and 50% utilization:
June 12, 2025 at 11:46 AM