More

camel-cdr · 2026-02-19T22:55:21 1771541721

here you go: https://gist.github.com/camel-cdr/bd5b197ab140ad6df259916df1...

camel-cdr · 2026-02-18T15:28:33 1771428513

The 1024-bit RVV cores in the K3 are mostly that size to feed a matmul engine. While the vector registers are 1024-bit, the two exexution units are only 256-bit wide.

The main cores in the K3 have 256-bit vectors with two 128-bit wide exexution units, and two seperate 128-bit wide vector load/store units.

See also: https://forum.spacemit.com/uploads/short-url/60aJ8cYNmrFWqHn...

But yes, RVV already has more diverse vector width hardware than SVE.

0x000xca0xfe · 2026-02-18T17:25:27 1771435527

It's a low clocked (2.1GHz) dual-issue in-order core so obviously nowhere near the real-world performance of e.g. Zen5 which can retire multiple 256-bit or even 512-bit vector instructions per cycle at 5+ GHz.

But I find the RVV ISA just really fascinating. Grouping 8 1024-bit registers together gives us 8192-bit or 1-kilobyte registers! That's a tremendous amount of work that can be done using a single instruction.

Feels like the Lanz bulldog of CPUs. Not sure how practical it will be after all, but it's certainly interesting.

camel-cdr · 2026-02-18T15:24:16 1771428256

The problem with SVE is that ARM vendors need to make NEON as fast as possible to stay competitive, so there is little incentive to implement SVE with wider vectors.

Graviton3 has 256-bit SVE vector registers but only four 128-bit SIMD execution units, because NEON needs to be fast.

Intel previously was in such a dominant market position that they could require all performance-critical software to be rewritten thrice.

camel-cdr · 2026-02-18T15:17:25 1771427845

> SVE was supposed to be the next step for ARM SIMD, but they went all-in on runtime variable width vectors and that paradigm is still really struggling to get any traction on the software side.

You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

"runtime variable width vectors" doesn't capture well how SVE and RVV work. An RVV and SVE implementation has 32 SIMD registers of a single fixed power-of-two size >=128. They also have good predication support (like AVX-512), which allows them to masked of elements after certain point.

If you want to emulate avx2 with SVE or RVV, you might require that the hardware has a native vector length >=256, and then you always mask off the bits beyond 256, so the same code works on any native vector length >=256.

jsheard · 2026-02-18T16:10:37 1771431037

> You can treat both SVE and RVV as a regular fixed-width SIMD ISA.

Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time.

ARM seems to be proposing a C language extension which does require compilers to support variably sized types, but it's not clear to me how the implementation of that is going, and equivalent support in other languages like Rust seems basically non-existent for now.

camel-cdr · 2026-02-18T16:28:27 1771432107

> Kind of, but the part which looks particularly annoying is that you can't put variable-width vectors on the stack or pass them around as values in most languages, because they aren't equipped to handle types with unknown size at compile time

Yes, you can't, which is annoying, but you can if you compile for a specific vector length.

This is mostly a library structure problem. E.g. simdjson has a generic backend that assumes a fixed vector length. I've written fixed width RVV support for it. A vector length agnostic backend is also possible, but requires writing a full new backend. I'm planning to write it in the future (I alreasy have a few json::minify implementations), but it will be more work. If the generic backend used a SIMD abstraction, like highway, that support scalable vectors this wouldn't be a problem.

Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.

jsheard · 2026-02-18T16:36:04 1771432564

> Toolchain support should also be improved, e.g. you could make all vregs take 512-bit on the stack, but have the codegen only utilize the lowee 128 bit, if you have 128-but vregs, 256-bit if you have 256-bit vregs and 512-bit if you have >=512-bit vregs.

SVE theoretically supports hardware up to 2048-bit, so conservatively reserving the worst-case size at compile time would be pretty wasteful. That's 16x overhead in the base case of 128-bit hardware.

arka2147483647 · 2026-02-18T22:35:15 1771454115

Surely you could have compiler types for 128, 256, 512, etc, and then choose the correct codepath with simple if statement at runtime?

pertymcpert · 2026-02-18T19:02:43 1771441363

You can definitely SVE vectors on the stack, there are special instructions to load and store with variable offsets. What you can't do is to put them into structs which need to have concretely sized types (i.e. subsequent element offset need to have a known byte offset).

camel-cdr · 2026-02-12T10:18:24 1770891504

I like this document, but it seems to be written with a very specific implementation in mind.

You can implement both regular SIMD ISAs and scalable SIMD/Vector ISAs in a "Vector processor" style and both in a regular SIMD style.

shash · 2026-02-12T11:28:09 1770895689

It _is_ RISC-V Vector extensions, so a very specific ISA in mind at the very least. There's another extension (not ratified I think) called Packed SIMD for RISC-V, but this isn't about that.

camel-cdr · 2026-02-12T12:09:23 1770898163

The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector. BTW, you'll be disappointed if you think of the P extension as something like SSE/AVX. The target for it is way lower power/perf, like a stripped-down MMX.

My point was about the underlying hardware implementation, specifically:

> "As shown in Figure 1-3, array processors scale performance spatially by replicating processing elements, while vector processors scale performance temporally by streaming data through pipelined functional units"

Applies to the hadware implementation, not the ISA, which is not made clear by the text.

You can implement AVX-512 with smaler data path then register width and "scale performance temporally by streaming data through pipelined functional units". Zen4 is a simple example of this, but there is nothing stopping you from implementing AVX-512 on top of heavily temporaly pipelined 64-bit wide execution units.

Similarly, you can implement RVV with a smaller data path than VLEN, but you can also implement it as a bog-standard SIMD processor. The only thing that slightly complicates the comparison is LMUL, but it is fundamentally equivilant to unrolling.

The substantial difference between Vector and SIMD ISAs is imo only the existence of a vl-based predication mechanism. If a SIMD ISA has a fixed register width or not, allowing you to write vector-length agnostic code, is an independent dimension of the ISA design. E .g. the Cray-1 was without a doubt a Vector processor, but the vector registers on all compatible platforms had the exact same length. It did, however, have the mentioned vl-based predication mechanism. You could take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, and end up with a scalable SIMD ISA, for which you can write vector length agnostic code, but that doesn't make it a Vector ISA any more than it was before.

shash · 2026-02-13T04:52:15 1770958335

> The name, yes, but going by name is a bad idea as the V in AVX also stands for Vector.

Now I get your point after reading more of the linked page. Yes. It is very implementation specific.

One of the things about RVV (and in general any vector ISA) is that the data path can be different enough between different implementations such that specific rules of thumb for hand tuning most probably won’t carry over. As you say it is true of even sufficiently advanced SIMD architectures like AVX.

actionfromafar · 2026-02-12T15:58:40 1770911920

Stripped down MMX? What's left then I wonder? :-D

camel-cdr · 2026-02-12T16:10:19 1770912619

That was a bit overblown, due to my lack of knowlage about MMX. It has a lot more things than MMX. But the core idea behind the P extension was to reuse the GPRs to do SIMD operations with little additional implementation cost.

The spec is currently all over the place, the best reference is currently probably the WIP intrinsics documentation: https://github.com/topperc/p-ext-intrinsics/blob/main/source...

P is not meant to compete/be an alternative for RVV. It's meant for hardware targets you can't scale RVV down to.

Narishma · 2026-02-15T06:47:41 1771138061

> But the core idea behind the P extension was to reuse the GPRs to do SIMD operations with little additional implementation cost.

I think ARMv6 had something similar, before they went with proper SIMD in v7.

shash · 2026-02-13T04:45:26 1770957926

As sibling said, stripped down in the sense it doesn’t have dedicated registers. In terms of supported functions it’s somewhere close to MMX.

I don’t personally like it because it still ends up with all the headache of building most of a vector subsystem (data path, functional units,…) while _only_ pretty much reducing one special vector file.

camel-cdr · 2026-01-25T09:18:04 1769332684

No, the 2.5GHz are for SFX4. Atlantis is on TSMC 12nm and (as I learned yesterday) will run at about 1.5GHz: https://cdn.discordapp.com/attachments/1061659786023813170/1...

So Ascalon should have M1 IPC, at half the frequency.

brucehoult · 2026-01-25T20:55:17 1769374517

It really doesn't matter much. The Titan and K3 are Core 2 performance, the K1 and JH7110 are more like Pentium III.

A 1.5 GHz Ascalon is still going to be ... I don't know ... Skylake level? More than enough for a usable modern desktop machine and a huge leap over even machines we'll start to have delivered 3 or 4 months from now.

Hopefully it will be affordable. As in Megrez or Titan prices, not Pioneer.

LeFantome · 2026-02-02T21:20:27 1770067227

The K3 is launched now.

Single core performance is about what you say. But multi-core performance is much better. The K3 scores higher than a 2017 Macbook Air for multi-core on Geekbench 6.

And the K3 can take 32 GB of DDR5 and run a decent-sized LLM, which is not something you are doing on an a 5-10 year old laptop. In addition to the vector instructions, the built-in video codec acceleration and hypervisor stuff make for quite a modern feature-set.

The K3 is still too slow to be a desktop system for most people but there are some of us who would already be ok with it.

As for pricing, it is hard to find info. But it seems like around $200 may be possible for the Jupiter2.

https://milkv.io/jupiter2

The Framework 13 K3 mainboard will be more:

https://deepcomputing.io/dc-roma-risc-v-mainboard-iii-unveil...

brucehoult · 2026-02-03T03:00:40 1770087640

Yes, I've been using a K3 for a few weeks now. It's quite pleasant, and if I use all 16 cores (8x X100 and 8x A100) then it builds a Linux kernel almost 3x faster than my one year old Milk-V Megrez and almost 5x faster than K1.

    14m25.56s  SpacemiT K3, 8 X100 cores + 8 A100 cores
    16m55.637s SpacemiT K3, 8 X100 cores @2.4 GHz
    19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
    39m23.187s SpacemiT K3, 8 A100 cores @2.0 GHz
    42m12.414s Milk-V Megrez, 4 P550 cores @1.8 GHz
    67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
    70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz

It's also great that it's now faster than a recent high end x86 with a lot of cores running QEMU.

Note that the all-cores K3 result is running a distccd on each cluster, which adds quite a bit of overhead compared to a simple `make` on local cores. All the same it shaves 2.5 minutes off. In theory, doing Amdahl calculation on the X100 and A100 times, it might be possible to get close to 11m50s with a more efficient means of using heterogenous cores, but distcc was easy to do.

RISC-V SBC single-core performance has been better than x86+QEMU since the VisionFive 2 (or HiFive Unmatched) but we didn't have enough cores unless you spent $2500 for a Pioneer.

snvzz · 2026-01-26T00:51:24 1769388684

>BXM-4-64

Is that among the few known to work with open pvr drivers?

camel-cdr · 2026-01-20T16:06:41 1768925201

can you share the code?

camel-cdr · 2026-01-20T07:04:38 1768892678

https://tutorial.xiangshan.cc/hpca25/slides/20250302-HPCA25-...

camel-cdr · 2026-01-20T06:40:17 1768891217

> The Tenstorrent Ascalon is supposed to be as fast as AMD Ryzen 5

Don't set your self up for dissapointment. Ascalon is supposed to match Zen5 performance per clock, but at 2.5GHz, so will still be at a minimum 2x slower.

Additionally, the announces Ascalon devboard is supposed to be on an older node and have an ever lower frequency due to that. (the 2.5GHz were on SF4X, the devboard may be on something like 12nm)

LeFantome · 2026-01-21T18:16:44 1769019404

You are right to be cautious.

Ryzen 5 is not Zen 5. So I am not predicting Zen 5 level performance for Ascalon.

I am expecting Zen 3 level performance which is to say about as fast as laptops from 2017 to 2020 or so. That is better than what I am typing on now.

So, not crushing Apple Silicon just yet but "usable" for the first time. Instead of "there are no RISC-V chips as fast as a Raspberry Pi", it will be "Intel is still faster". It may not even be that ARM is faster anymore. It will be more of a chip by chip comparison. At least people will have to admit that it is a race.

I am not looking for RISC-V to be "best in the world" in 2026. Rather, I want to stop hearing that it will never get there. After Ascalon, you will not be able to make the blanket statement that RISC-V is not good enough. It will be good enough in some markets and not in others. It will have a seat at the table.

And I want to be able to use RISC-V. Ascalon bring RISC-V into "good enough for me" territory.

And RISC-V will only get better. It is getting better faster than other chips are. My thesis is that this will continue (though that is certainly a bold prediction).

Even just looking at Tenstorrent, Babylon is not far behind Ascalon. And there is SciFive. And there is Andes. And there is SpaceMIT. And there is Alibaba. And there is Qualcomm. And there are companies I do not know about yet. And there are nation-states. There is a pretty big tidal wave headed for ARM (and maybe even AMD/Intel).

First they laugh at you. And then you win.

camel-cdr · 2026-01-19T14:23:24 1768832604

The SpacemiT K3 with 8 SpacemiT X100 RVA23 cores, which are faster than Pi4 but slower than Pi5, should be available in a couple of months:

geekbench: https://browser.geekbench.com/v6/cpu/16145076

rvv-bench: https://camel-cdr.github.io/rvv-bench-results/spacemit_x100/...

There are also 8 additional SpacemiT-A100 cores with 1024-bit wide vectors, which are more like an additional accelerator for number crunshing.

The Milk-V Titan has slightly faster scalar performance, than the K3.

LeFantome · 2026-01-21T18:21:39 1769019699

> faster than Pi4 but slower than Pi5

It may actually be faster than a Pi5.

The benchmark is well tuned for ARM64 but not so well adapted to RISC-V, especially the vector extensions.

You may still be right of course. The SpaceMIT K3 is exciting because it may still be the first RVA23 hardware but it is not exectly going to launch a RISC-V laptop industry.

camel-cdr · 2026-01-22T08:07:33 1769069253

There isn't much to tune in some, e.g. the clang benchmark. We know that many of the benchmarks already have RVV support (compare BPI-F3 results between versions) and three are still missing RVV support. I think the optimized score would be in the 500s, but that's still a lot lower than Pi5.

kombine · 2026-01-19T15:07:22 1768835242

> The Milk-V Titan has slightly faster scalar performance, than the K3.

So the main difference between this Milk-V Titan and the upcoming SpacemiT K3 is that the latter has better vector performance?

camel-cdr · 2026-01-19T15:19:07 1768835947

The Titan has no SIMD/Vector support at all, so it doesn't support RVA23.

snvzz · 2026-01-19T21:20:59 1768857659

The K3 is able to run RVA23 code, the Titan is not; it lacks V.

It matters, as the ecosystem settled on RVA23 as the baseline for application processors.

LeFantome · 2026-01-20T11:28:42 1768908522

Well, today it is only Ubuntu 25.10 and newer that require RVA23. Almost everything else will run on plain old RV64GC which this board handles no problem.

But you are correct that once RVA23 chips begin to appear, everybody will move to it quite quickly.

RVA23 provides essentially the same feature-set as ARM64 or x86-64v4 including both virtualization and vector capabilities. In other words, RVA23 is the first RISC-V profile to match what modern applications and workflows require.

The good news is that I expect this to remain the minimum profile for quite a long time. Even once RVA30 and future profiles appear, there may not be much pressure for things like Linux distributions to drop support for RVA23. This is a lot like the modern x86-64 space where almost all Linux distributions work just fine on x86-64 v1 even though there are now v2, v3, and v4 available as well. You can run the latest edition of Arch Linux on hardware from 2005. It is hard to predict the future but it would not surprise me if Ubuntu 30.04 LTS ran just fine on RISC-V hardware released later this year.

But ya, anything before RVA23, like the RVA22 Titan we are discussing here, will be stuck forever on older distros or custom builds (like Ubuntu 25.04).