I was expecting something like TensorRT or Triton, but found "Vibe Coding" The p...

germanjoey · 2025-05-14T22:17:12 1747261032

TBH, the 2x-4x improvement over a naive implementation that they're bragging about sounded kinda pathetic to me! I mean, it depends greatly on the kernel itself and the target arch, but I'm also assuming that the 2x-4x number is their best case scenario. Whereas the best case for hand-optimized could be in the tens or even hundreds of X.

godelski · 2025-05-15T03:58:28 1747281508

I'm a bit confused. It sounds like you are disagreeing ("TBH") but the content seems like a summary of my comment. So, I agree.

Fwiw, they did say they got up to 20x improvement but given the issues we both mention this may not be surprising given that this seems to be an outlier by their own omission.

jaberjaber23 · 2025-05-15T08:36:44 1747298204

absolutely. it really depends on the kernel type, target architecture, and what you're optimizing for. the 2x-4x isn’t the limit, it's just what users often see out of the box. we do real-time profiling on actual GPUs, so you get results based on real performance on a specific arch, not guesses. when the baseline is rough, we’ve seen well over 10x

jaberjaber23 · 2025-05-15T06:52:22 1747291942

totally agree. we're not trying to replace deep CUDA knowledge:) just wanted to skip the constant guess and check.

every time we generate a kernel, we profile it on real GPUs (serverless) so you see how it runs on specific architectures. not just "trust the code" we show you what it does. still early, but it’s helping people move faster

godelski · 2025-05-15T07:25:18 1747293918

Btw, I'm not talking deep CUDA knowledge. That takes years. I'm specifically talking about novices. The knowledge you get from a few weeks. I'd be quite hesitant to call someone an expert in a topic when they have less than a few years of experience. There's exceptions but expertise isn't quickly gained. Hell, you could have years of experience but if all you did is read medium blogs and stack overflow you'd probably still be a novice.

I get that you profile. I liked that part. But even as the other commenter says, it's unclear how to evaluate given the examples. Showing some real examples would be critical to sell people on this. Idk, maybe people blindly buy too but personally I'd be worried about integrating significant tech debt. It's easy to do that with kernels or anytime you're close to the metal. The nuances dominate these domains

jaberjaber23 · 2025-05-15T15:59:32 1747324772

Do you have a place where we can chat? Linkedin,....

godelski · 2025-05-15T20:04:38 1747339478

Sorry, I'm not the CUDA expert you should be looking for. My efforts are in ML and I only dabble in CUDA and am friends with systems people. I'd suggest reaching out to system people.

I'd suggest you use that NVIDIA connection and reach out to the HPC teams there. Anyone working on CUTLASS, TensorRT, cuTensor, or maybe event the CuPy team could give you a lot better advice than me.

jaberjaber23 · 2025-05-15T21:23:52 1747344232

I really appreciate that!! thanks:D