I don't think this is correct. For inference, the bottleneck is memory bandwidth...

fooblaster · 2026-01-03T02:56:25 1767408985

Show me a single FPGA that can outperform a B200 at matrix multiplication (or even come close) at any usable precision.

B200 can do 10 peta ops at fp8, theoretically.

I do agree memory bandwidth is also a problem for most FPGA setups, but xilinx ships HBM with some skus and they are not competitive at inference as far as I know.

checker659 · 2026-01-03T05:27:26 1767418046

Said GPUs spend half the time just waiting for memory.

fooblaster · 2026-01-03T07:19:29 1767424769

Yep, but they are still 50x faster than any fpga.

dnautics · 2026-01-03T08:24:27 1767428667

probably not B200 level but better than you might expect:

https://www.positron.ai/

i believe a B200 is ~3x the H200 at llama-3, so that puts the FPGAs at around 60% the speed of B200s?

fooblaster · 2026-01-03T20:35:55 1767472555

I wouldn't trust any benchmarks on the vendors site. Microsoft went down this path for years with FPGAs and wrote off the entire effort.

dnautics · 2026-01-04T00:31:50 1767486710

ok? i worked on those devices, those numbers are real. theres a reason why they compare to h200 and not b200

> I have worked with FPGAs that outperform H200s in Llama3-class models a while and a half ago

fooblaster · 2026-01-04T15:51:17 1767541877

I'd like to know more. I expect these systems are 8xvh1782. Is that true? What's the theoretical math throughput - my expectation is that it isn't very high per chip. How is performance in the prefill stage when inference is actually math limited?

dnautics · 2026-01-05T22:22:34 1767651754

i was a software guy, sorry, but those token rates are correct and what was flowing through my software.

i believe there was a special deal on super special fpgas. there were dsps involved.