> Sure, if you use X3D chip Ah, sorry, lscpu shows: L3: 64 MiB (2 instances) I o...

zamadatix · on Aug 15, 2024

Say it were a 6 disk pool and you add an object to a database (with the goal of doing many of these as fast as possible with fsync to the disks):

- Receive the new data

- Read the multiple disks to get the current stripe(s) associated with it.

- Calculate the new parity

- Issue the multiple writes

- Wait for completion, clear that from RAM

Looking at a single write it doesn't seem so bad. You take something like ~128k in from the disks per stripe (which will arrive it ever so slightly different times and be held as that thread stalls before the calc), issue a bunch of writes, wait for that to clear while the result remains in memory (cache or RAM), then you're good to clear it out and that thread/coroutine task can process the next one. "Just" 3 GB/s is ~23,000/s of that - doing those multiple reads into RAM, parity writes into RAM (well, unless you can stick it all in massive L3 by keeping queue depths low), and caching until spat out on to the drive. On a normal non-parity setup you just have your data to be written sit and go to disk, no intermediate reads/writes.

This may not make sense on a home box but consider the approach more an alternative to solutions like https://www.graidtech.com/product/sr-1000/ which are single cards that can get a million RAIDed IOPS written at near 100 GB/s in a single PCIe slot alone with no additional load to the CPU. Just writing 100 GB/s takes a CPU core and most of the RAM bandwidth from a raw data creation/parsing perspective before talking about writing it to disk at all, it's a different problem than e.g. what the bandwidth looks like on a home NAS pool. This type of approach tries to do something similar without the extra device in-between the cards and the server.

Sometimes you also want to take the above approach and scale it out over many 100G/400G ethernet ports so your flash storage pools are reachable over network separate from compute nodes. Here the goal is to make that storage solution as dense, fast, and efficient as possible where you might want to load as much possible storage as you can on a single node until it saturates the bandwidth to the CPU. If you can do that without doubling back data to the CPU you can scale it that much better.