an Out-of-order Vector
Processor" slides are public.
static.sched.com/hosted_files...
an Out-of-order Vector
Processor" slides are public.
static.sched.com/hosted_files...
1. with fixed-size buffers asan won't catch everything.
2. VLAs are faster than malloc, in my case I get 15% faster fuzzing.
If VLAs aren't portable enough, just check __STDC_NO_VLA__ and select between the other options.
1. with fixed-size buffers asan won't catch everything.
2. VLAs are faster than malloc, in my case I get 15% faster fuzzing.
If VLAs aren't portable enough, just check __STDC_NO_VLA__ and select between the other options.
camel-cdr.github.io/rvv-bench-re...
Overall, the results look really good so far:
camel-cdr.github.io/rvv-bench-re...
Overall, the results look really good so far:
Please consider including SAG/inverse-SAG with bit-reversal of the goats.
No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
Please consider including SAG/inverse-SAG with bit-reversal of the goats.
No matter which of the two implementation methods you are using: All you need to do is not mask the goat bits.
Ventanas Veyron V2/V3 seem to also use something like a trace cache.
Ventanas Veyron V2/V3 seem to also use something like a trace cache.
Civil was so nice run my RVV benchmark on the SiFive X280 cores on the Tenstorrent Blackhole.
Civil was so nice run my RVV benchmark on the SiFive X280 cores on the Tenstorrent Blackhole.
If you use svc with inline assembly, you have to explicitly clobber SVE registers.
Good luck doing this back in 2015 when you wrote
If you use svc with inline assembly, you have to explicitly clobber SVE registers.
Good luck doing this back in 2015 when you wrote
Since you were deeply involved in the development of the bitmanip spec, I was wondering if you could answer some questions about your bextdep implementation.
Since you were deeply involved in the development of the bitmanip spec, I was wondering if you could answer some questions about your bextdep implementation.
> When source and destination registers overlap and have different EEW, the instruction is mask- and tail-agnostic, regardless of the setting of the vta and vma bits in vtype.
> When source and destination registers overlap and have different EEW, the instruction is mask- and tail-agnostic, regardless of the setting of the vta and vma bits in vtype.
"Efficient Architecture for RISC-V Vector Memory Access" -- arxiv.org/abs/2504.08334
I love how these two were released so close to each other.
"Efficient Architecture for RISC-V Vector Memory Access" -- arxiv.org/abs/2504.08334
I love how these two were released so close to each other.
Here is a SVE vs NEON benchmark with 1/2/4x unrolled for each: godbolt.org/z/47T9oaf97
SVE is faster across the board, with the fastest SVE version being about 30% faster than the fastest NEON version.
Here is a SVE vs NEON benchmark with 1/2/4x unrolled for each: godbolt.org/z/47T9oaf97
SVE is faster across the board, with the fastest SVE version being about 30% faster than the fastest NEON version.
Today's article is on Alibaba/T-Head's Xuantie C910 core which has in part been open sourced and is T-Head's first out of order core.
Hope y'all enjoy!
open.substack.com/pub/chipsand...
old.chipsandcheese.com/2025/02/03/a...
Today's article is on Alibaba/T-Head's Xuantie C910 core which has in part been open sourced and is T-Head's first out of order core.
Hope y'all enjoy!
open.substack.com/pub/chipsand...
old.chipsandcheese.com/2025/02/03/a...
gist.github.com/camel-cdr/d1...
I got inspired yesterday, after I saw the article "When Greedy Algorithms Can Be Faster" (16bpp.net/blog/post/wh...)
It ended up about 2x faster then the simple rejection sampling.
gist.github.com/camel-cdr/d1...
I got inspired yesterday, after I saw the article "When Greedy Algorithms Can Be Faster" (16bpp.net/blog/post/wh...)
It ended up about 2x faster then the simple rejection sampling.
(code on master branch also handle quotes)
The segmented scan is done in scalar/SWP to avoid inter-lane overheads.
I've tested it, but not benchmarked it yet.
(code on master branch also handle quotes)
The segmented scan is done in scalar/SWP to avoid inter-lane overheads.
I've tested it, but not benchmarked it yet.
If anyone has a good AVX512 solution, please share.
Inspired by this reddit post: www.reddit.com/r/simd/comme...
#RVV #RISC-V #SIMD
If anyone has a good AVX512 solution, please share.
Inspired by this reddit post: www.reddit.com/r/simd/comme...
#RVV #RISC-V #SIMD
www.youtube.com/watch?v=r_pP... (55:00)
www.youtube.com/watch?v=r_pP... (55:00)
It's probably too early for a definite answer. But as I've designed SIMD image processing algorithms, I'll share a few results.
It's probably too early for a definite answer. But as I've designed SIMD image processing algorithms, I'll share a few results.
New articles feed: camel-cdr.github.io/rvv-bench-re...
Benchmark updates feed: camel-cdr.github.io/rvv-bench-re...
New articles feed: camel-cdr.github.io/rvv-bench-re...
Benchmark updates feed: camel-cdr.github.io/rvv-bench-re...