I like this writeup as it summarizes my journey with optimizing some cuda code I...

otherjason · on Oct 11, 2024

For devices with compute capability of 7.0 or greater (anything from the Volta series on), a single thread block can address up to the entire shared memory size of the SM; the 48 kB limit that older hardware had is no more. Most contemporary applications are going to be running on hardware that doesn’t have the shared memory limit you mentioned.

The claim at the end of your post, suggesting that >1 block per SM is always better than 1 block per SM, isn’t strictly true either. In the example you gave, you’re limited to 60 blocks because the thread count of each block is too high. You could, for example, cut the blocks in half to yield 120 blocks. But each block has half as many threads in it, so you don’t automatically get any occupancy benefit by doing so.

When planning out the geometry of a CUDA thread grid, there are inherent tradeoffs between SM thread and/or warp scheduler limits, shared memory usage, register usage, and overall SM count, and those tradeoffs can be counterintuitive if you follow (admittedly, NVIDIA’s official) guidance that maximizing the thread count leads to optimal performance.

dahart · on Oct 11, 2024

Good points, though I agree with sibling that higher occupancy is not the goal; higher performance is the goal. Since registers are such a precious resource, you often want to set your block size and occupancy to whatever is best for keeping active state in registers. If you push the occupancy higher, then the compiler might be forced to spill registers to VRAM, that that will just slow everything down even though the occupancy goes up.

Another thing to maybe mention, re: “if your GPU has 60 SMs, and each block uses one SM, you can only run 60 blocks in parallel”… CUDA tends to want to have at least 3 or 4 blocks per SM so it can round-robin them as soon as one stalls on a memory load or sync or something else. You might only make forward progress on 60 separate blocks in any given cycle, but it’s quite important that you have like, for example, 240 blocks running in “parallel”, so you can benefit from latency hiding. This is where a lot of additional performance comes from, doing work on one block while another is momentarily stuck.

winwang · on Oct 11, 2024

Is this really true in general? I'd expect it to be true for highly homogenous blocks, but I'd also expect that kernels where the warps are "desynced" in memory operations to do just fine without having 3-4 blocks per SM.

dahart · on Oct 11, 2024

Oh I think so, but I’m certainly not the most expert of CUDA users there is. ;) Still, you will often see CUDA try to alloc local and smem space for at least 3 blocks per SM when you configure a kernel. That can’t possibly always be true, but is for kernels that are using modest amounts of smem, lmem, and registers. In general I’d say desynced mem ops are harder to make performant than highly homogeneous workloads, since those are more likely to be uncoalesced as well as cache misses. Think about it this way: a kernel can stall for many many reasons (which Nsight Compute can show you), especially memory IO, but even for compute bound work, the math pipes can fill, the instruction cache can miss, some instructions have higher latency than others, etc. etc. Even a cache hit load can take dozens of cycles to actually fill. Because stalls are everywhere, these machines are specifically designed to juggle multiple blocks and always look for ways to make forward progress on something without having to sit idle, that is how to get higher throughput and hide latency.

einpoklum · on Oct 12, 2024

Well, yes, but "desynced" warps don't use shared memory - because writes to it require some synchronization for other warps to be able to read the information.

winwang · on Oct 14, 2024

Why would that be true? Certainly there are algorithms (or portions of them) in which warps can just read whichever values exist in shared mem at the time, no need to sync. And I think we were mostly talking about global memory?

dahart · on Oct 14, 2024

I don’t think it’s possible to use shared memory without syncing, and I don’t think there are any algorithms for that. I think shared memory generally doesn’t have values that exist before the warps in a block get there. If you want to use it, you usually (always?) have to write to smem during the same kernel you read from smem, and use synchronization primitives to ensure correct order.

There might be such a thing as cooperative kernels that communicate through smem, but you’d definitely need syncs for that. I don’t know if pre-populating smem is a thing that exists, but if it does then you’ll need kernel level or device level sync, and furthermore you’d be limited to 1 thread per CUDA core. I’m not sure either of those things actually exist, I’m just hedging, but if so they sound complicated and rare. Anyway, the point is that I think if we’re talking about shared memory, it’s safe to assume there must be some synchronizing.

I also assumed by “desynced” you meant threads would be doing scattered random access memory reads, since the alternative offered was homogeneous workloads. That’s why I assumed memory perf might be low or limiting due to low cache hit rates and/or low coalescing. In the case of shared memory, even if you have syncs, random access reads might lead to heavy bank conflicts. If your workload has a very ordered access pattern, if that’s what you meant, but you just don’t need any synchronization, then in that case there’s no problem and perf can be quite good. In any case, it’s a good idea to minimize memory access and strive to be compute bound instead of memory bound. Memory tends to be the bottleneck most of the time. I’ve only seen truly optimized and compute bound kernels a small handful of times.

einpoklum · on Oct 14, 2024

there is no guarantee of order of actions taking effect. i.e. warp 1 writes to some shared memory address; warp 2 reads from that address. How can you guarantee the write happens before the read?

jhj · on Oct 11, 2024

Aiming for higher occupancy is not always a desired solution, what frequently matters more is avoiding global memory latencies by retaining more data in registers and/or shared memory. This was first noted in 2010 and is still true today:

https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pd...

I would also think in terms of latency hiding rather than just work parallelism (though latency hiding on GPUs is largely because of parallelism). This is the reason why GPUs have massive register files, because unlike modern multi-core CPUs, we omit latency reducing hardware (e.g., speculative execution, large caches, that out-of-order execution stuff/register renaming etc) and in order to fill pipelines we need to have many instructions outstanding, which means that the operands for those pending arguments need to remain around for a lot longer, hence the massive register file.

elashri · on Oct 11, 2024

I agree that optimizing for lower occupancy can yield significant performance gains in specific cases, especially when memory latencies are the primary bottleneck. Leveraging ILP and storing more data in registers can indeed help reduce the need for higher occupancy and lead to more efficient kernels. The examples in the GTC2010 talks highlighted that quite well. However, I would argue that occupancy still plays an important role, especially for scalability and general-purpose optimization. Over-relying on low occupancy and fewer threads, while beneficial in certain contexts, has its limits.

The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted, which drastically reduces performance. This becomes more pronounced as problem sizes scale up (the talk examples avoids that problem). Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources. Focusing too much on minimizing thread counts can lead to underutilization of the SM’s parallel execution units. An standard example will be inference engines.

Also, while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels), designing code that depends on such strategies as a general practice can result in less adaptable and robust solutions for a wide variety of applications.

I believe there is a balance to strike here. low occupancy can work for specific cases, higher occupancy often provides better scalability and overall performance for more general use cases. But you have to test for that while you are optimizing your code. There will not be a general rule of thump to follow here.

jhj · on Oct 12, 2024

> The first thing to consider is the register pressure. Increasing the number of registers per thread to optimize for ILP can lead to register spilling when the register file is exhausted

Kernels should almost never use local memory (except in arcane cases where you are using recursion and thus a call stack that will spill where an alternative non-recursive formulation would not really work).

> Many real-world applications, especially compute-bound kernels, need high occupancy to fully utilize the GPU’s resources

> while low-occupancy optimizations can be effective for specific workloads (e.g, memory-bound kernels)

I think this is almost exactly backwards, performant high compute intensity kernels (on a (fl)op/byte of memory traffic basis) tend to uniformly have low occupancy; look at a ncu trace of many kernels in cuBLAS or cuDNN for instance. You need a large working set of arguments in registers or in smem to feed scalar arithmetic or especially MMA units quickly enough as gmem/L2 bandwidth alone is not sufficient to achieve peak performance in many case. The only thing you need to do is to ensure that you are using all SMs (and thus all available scalar arithmetic or MMA units) which does not by itself imply high occupancy (e.g., a kernel that has 1 CTA per SM).

The simplest way to write a memory-bound kernel is to simply spawn a bunch of threads and perform load/stores from them and it isn't too hard to achieve close to peak this way, but even then depending upon the warp scheduler to rotate other warps in to issue more load/stores is inferior to unrolling loops, and you can also get close to peak mem b/w by using not too many SMs either through such unrolling, so even these need not have high occupancy.

(I've been Nvidia GPU programming for around 11 years and wrote the original pytorch GPU backend/tensor library, the Faiss GPU library, and contributed some stuff to cuDNN in its early days such as FFT convolution.)