Lightnews — Scholar-powered news

camel-cdr.bsky.social

@camel-cdr.bsky.social

I only slightly disagree with using segmented load/store transpose. If you need to transpose from memory fine, but if you need register to register going though memory isn't the best. I'd use vslide1up/down or in the future vpaire/vpairo: github.com/ved-rivos/ri...

riscv-isa-manual/src/zvzip.adoc at zvzip · ved-rivos/riscv-isa-manual

RISC-V Instruction Set Manual. Contribute to ved-rivos/riscv-isa-manual development by creating an account on GitHub.

github.com

October 23, 2025 at 10:51 AM

camel-cdr.bsky.social

@camel-cdr.bsky.social

*correction: 0.5/0.5/2/4 for vector-scalar/immediate compares (0.5/2/4/8 for vector-vector)

September 25, 2025 at 5:59 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

For the scalar instructions:
* 6-issue: add/sub/lui/xor/sll/shNadd/zext/clz/cpop/min/rotl/rev8/bext/...
* 3-issue: load/store
* 2-issue: fadd/fmul/fmacc/fmin/fcvt
* 1-issue: mul/mulh/feq/flt
* pipelined: fsqrt/fdiv: ~8.5, div/rem: 12-16

September 25, 2025 at 5:50 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

My takeaway so far is to not be scared to use the segmented load/stores, and LMUL>1 permutes are good, but you probably want to avoid LMUL=8 ones when possible. I'll continue manually unrolling none-lane-crossing permutes. For LMUL>1 comparisons, it's better to use .vx/vi over .vv when possible.

September 25, 2025 at 5:50 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

The vslide1up/vslide1down do scale perfectly, though, with 0.5/1/2/4. It's not in the benchmark, but I hope vslideup/vslidedown with immediate "1" also do.

We'll have to wait for the other microbenchmarks to get a more complete picture.

September 25, 2025 at 5:50 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

* Ovlt behavior isn't supported, but I don't really care much about it

The only bigger negative thing I've seen so far is that the vslideup/vslidedown instructions don't scale linearly or close to linearly with LMUL, even for a small immediate shift amount like "3".

September 25, 2025 at 5:50 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

* dual-issue vrgather, with good scaling: 0.5/1/8/30
* dual-issue vcompress, with OK scaling: 0.5/3/6/17 (I still think this could get close to linear)
* Fault-only-first loads seem to have no overhead
* Segmented load/stores look quite fast, even the more exotic ones like seg7

September 25, 2025 at 5:50 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

* Most instructions have an inverse throughput of 0.5/1/2/4 for LMUL=1/2/4/8, even vslide1up/down, 64-bit vmulh, viota, vpopc and integer reductions
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)

September 25, 2025 at 5:49 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

Third Way is an unfortunate name: en.wikipedia.org/wiki/Third_W...

August 24, 2025 at 3:25 PM

Reposted

Claire Xen 🏳️‍⚧️ 🧙🏻‍♀️ 💖💛💙

@clairexen.bsky.social

So if you are currently involved with ISA-level decisions about inclusion of any pext/pdep-like instructions:

Please consider including SAG/inverse-SAG with bit-reversal of the goats.

No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.

July 25, 2025 at 11:30 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

It looks like the patent expires at the end of 2028. The earliest I could see a RVI extension ratified at this point is 2027, so it's definitely worth evaluating.

Also, the new diagrams are cool.

July 24, 2025 at 6:03 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

I've only watched the last hour this far, but I quite liked your take on null-terminated strings.
C really has to be understood with its history in mind.

July 20, 2025 at 8:12 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

www.youtube.com/watch?v=OPgj...

Ventana’s Second Gen RISC V Processor for Data Center and Other High Performance | Greg Favor

YouTube video by Ventana Micro

www.youtube.com

July 11, 2025 at 9:00 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

Their V2 slides say, that they have a macro-op cache equivalent in size to a regular 32 KiB icache.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.

July 11, 2025 at 8:59 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

Ohh, the talk recordings are on YouTube: www.youtube.com/watch?v=1lwz...

CBP2025 - Opening Remarks - Rami Sheikh

YouTube video by Rami Sheikh

www.youtube.com

June 28, 2025 at 9:22 AM

Reposted

Claire Xen 🏳️‍⚧️ 🧙🏻‍♀️ 💖💛💙

@clairexen.bsky.social

I wrote a reference implementation for a SAG without bit reflection: github.com/clairexen/ed..., and I wrote a parametric SAG core for any bit width: github.com/clairexen/ed...

edu-sag/param.v at main · clairexen/edu-sag

Educational 8-Bit Sheep-And-Goats (SAG) Verilog Reference IP - clairexen/edu-sag

github.com

June 20, 2025 at 4:04 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

>>> lut=np.array([ord('a'),0,ord('e'),0,ord('i'),0,0,ord('o'),0,0,ord('u'),0,0,0,0,0], dtype=np.uint8)
>>> inp=np.frombuffer(b"test128aeiou72761xjs",dtype=np.uint8)
>>> lut[(inp&31)>>1]==inp

June 16, 2025 at 2:19 PM

camel-cdr.bsky.social

@camel-cdr.bsky.social

4x 16-bit: 120 u^2 63% utilized, 5GHz met (49 slack)

2x 32-bit: 120 u^2 65% utilized, 5GHz met (52 slack)

1x 64-bit: 153 u^2 64% utilized, 5GHz met (14 slack)

So subsetting on SEW really doesn't make much sense compared to a .vx subset.

June 12, 2025 at 11:46 AM

camel-cdr.bsky.social

@camel-cdr.bsky.social

I got OpenROAD working and tested the bfly part of your implementation (so without decode) in a SIMD setup.

asap7, targeting 5GHz, 75% placement density and 50% utilization:

June 12, 2025 at 11:46 AM

Add to Home Screen

Light up
your news

Add to Home Screen

Light upyour news

Sign in to Lightnews

Sign up to start reading

Connect Bluesky

Connect with Bluesky

Light up
your news