* 6-issue: add/sub/lui/xor/sll/shNadd/zext/clz/cpop/min/rotl/rev8/bext/...
* 3-issue: load/store
* 2-issue: fadd/fmul/fmacc/fmin/fcvt
* 1-issue: mul/mulh/feq/flt
* pipelined: fsqrt/fdiv: ~8.5, div/rem: 12-16
* 6-issue: add/sub/lui/xor/sll/shNadd/zext/clz/cpop/min/rotl/rev8/bext/...
* 3-issue: load/store
* 2-issue: fadd/fmul/fmacc/fmin/fcvt
* 1-issue: mul/mulh/feq/flt
* pipelined: fsqrt/fdiv: ~8.5, div/rem: 12-16
We'll have to wait for the other microbenchmarks to get a more complete picture.
We'll have to wait for the other microbenchmarks to get a more complete picture.
The only bigger negative thing I've seen so far is that the vslideup/vslidedown instructions don't scale linearly or close to linearly with LMUL, even for a small immediate shift amount like "3".
The only bigger negative thing I've seen so far is that the vslideup/vslidedown instructions don't scale linearly or close to linearly with LMUL, even for a small immediate shift amount like "3".
* dual-issue vcompress, with OK scaling: 0.5/3/6/17 (I still think this could get close to linear)
* Fault-only-first loads seem to have no overhead
* Segmented load/stores look quite fast, even the more exotic ones like seg7
* dual-issue vcompress, with OK scaling: 0.5/3/6/17 (I still think this could get close to linear)
* Fault-only-first loads seem to have no overhead
* Segmented load/stores look quite fast, even the more exotic ones like seg7
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)
* 0.5/0.5/1/2 for vector-scalar/immediate compares and 0.5/1/2/- for narrowing instructions (see "Microarchitecture speculations" section)
Please consider including SAG/inverse-SAG with bit-reversal of the goats.
No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
Please consider including SAG/inverse-SAG with bit-reversal of the goats.
No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
Also, the new diagrams are cool.
Also, the new diagrams are cool.
C really has to be understood with its history in mind.
C really has to be understood with its history in mind.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.
It can store variable length entries of up to 48 macro ops, which can be fuses from non-sequential instruction runs by collapsing taken branches.
>>> inp=np.frombuffer(b"test128aeiou72761xjs",dtype=np.uint8)
>>> lut[(inp&31)>>1]==inp
>>> inp=np.frombuffer(b"test128aeiou72761xjs",dtype=np.uint8)
>>> lut[(inp&31)>>1]==inp
2x 32-bit: 120 u^2 65% utilized, 5GHz met (52 slack)
1x 64-bit: 153 u^2 64% utilized, 5GHz met (14 slack)
So subsetting on SEW really doesn't make much sense compared to a .vx subset.
2x 32-bit: 120 u^2 65% utilized, 5GHz met (52 slack)
1x 64-bit: 153 u^2 64% utilized, 5GHz met (14 slack)
So subsetting on SEW really doesn't make much sense compared to a .vx subset.
asap7, targeting 5GHz, 75% placement density and 50% utilization:
asap7, targeting 5GHz, 75% placement density and 50% utilization: