The problem isn’t just developing your own processor. Nvidia has a huge stack of pretty cutting edge technology including a huge stack from mellanox, an enormous OSS tool chain around CUDA, etc, that people seeking to make comparable products have to overcome.
Are you suggesting Meta or Google - who stand to save billions - won't be able to get top performance from their custom chips because their tooling/hardware won't support CUDA?
No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.
I understand, especially amongst googlers, there’s a belief there are no others smarter than a googler. But it’s simply not the case. Nvidia is excellent at their core competencies and business, which is making absurdly parallel compute platforms with absurdly powerful interconnects. I’m saying google or meta won’t beat Nvidia at hardware. I’d also point to the fact Nvidias ability to raise capital is the best on earth now, so even money isn’t a barrier.
The advantage CUDA gives is in the tool chains, libraries, research, and all that that tens of thousands of people are contributing to as part of their jobs, research, and hobbies. This is almost -more valuable- than getting top performance. Getting top techniques, top software, top everything by having everyone everywhere working to make the ecosystem of your stuff is invaluable. Google won’t have that. They will just have the hubris of googlers who believe they’re smarter.
I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize. Cost optimization comes much later after the market has been fully explored and directions are clean and diminishing returns on R&D kick in. Any company that doesn’t recognize that is run by CPA and deserves the ignominy they’ll face.
The bar for success for Google and Meta is much lower than Nvidia - at least for internal usage. Any dollar amount that Google saves on CapEx or OpEx by using custom silicon instead of buying Nvidia helps bring down the cost of revenue. They don't have to match Nvidia on raw performance, and can aim at being better at performance per watt or performance per dollar (TCO) for larger workloads, and IIRC, Google is already doing for some internal inferencing tasks.
> I would also note that at this phase of a cycle in tech trying to save billions takes your eye off the prize
Big Tech companies are conglomerate-ish and can multitask. The search engine folk aren't pushing stuff back onto the backlog to put out fires delaying chip tape-out, and I bet the respective CEOs aren't burning braincycles micromanaging silicon development either; directors 2-3 rungs below the C-suite can motivate and execute on such an undertaking. The answer to "I need a budget of $300M in order to save the company $5-15B over 3 years" is "How soon can you start?"
>No. I’m suggesting they won’t because IP like the mellanox treasure chest they acquired is ridiculously difficult to develop and Nvidia has aggressively exploited it, along with their other already advanced IP in the space of their -core business-.
For training Llama3 Facebook set up two clusters, one using fancy InfiniBand and one just using RoCE over Arista cards: https://engineering.fb.com/2024/03/12/data-center-engineerin... . The latter ended up doing fine, suggesting that all that Mellanox stuff isn't necessary for large-scale training (apparently at a large enough scale ethernet scales better than InfiniBand).