tech

Who is building the 100,000 GPU clusters for xAI?

There are not enough GPUs to satisfy the market's ambitions.

Companies controlled by Elon Musk—SpaceX, Tesla, xAI, and X (formerly Twitter)—all require a significant number of GPUs, each for their own specific AI or HPC projects. But the problem is, there aren't enough GPUs to meet their respective ambitions. Thus, Musk must prioritize where the GPUs he can obtain should go.

As early as 2015, Musk was a co-founder of OpenAI. After a power struggle in 2018 (which we believe is closely related to the huge investments needed to drive AI models and the governance of these AI models), Musk left OpenAI, paving the way for Microsoft to enter with substantial funding. Seeing that OpenAI has become the dominant force in production-level generative AI, Musk quickly established xAI in March 2023, and since then, this startup has been striving to raise funds and find GPU allocations to build competitive infrastructure to compete with companies like OpenAI/Microsoft, Google, Amazon Web Services, and Anthropic.

Advertisement

Raising funds is the easier part.

At the end of May, Andreessen Horowitz, Sequoia Capital, Fidelity Management, Lightspeed Venture Partners, Tribe Capital, Valor Equity Partners, Vy Capital, and Kingdom Holding (a Saudi royal family holding company) all invested $6 billion in xAI's Series B financing round, bringing its total funding to $6.4 billion. This is a good start, and fortunately, Musk received a $45 billion compensation package from Tesla, so he can increase xAI's GPU funding at any time. (He may wisely set aside some of this funding for GPU allocations for Tesla, X, and SpaceX.)

To some extent, Tesla will pay Musk a one-time fee of $44 billion needed for the acquisition of X in April 2022, plus an additional $1 billion. This is equivalent to a cluster with 24,000 GPUs, but it's just pocket change. To be fair, Tesla has shaken up the automotive industry, with sales of $96.8 billion in 2023, a net income of $15 billion, and cash of $29.1 billion. But even in this newly gilded age, this is a staggering compensation package. However, Musk has big things to do, and his board is willing to sacrifice Tesla's cash, and even more, just to keep him happy.However, following the same logic, we are willing to acquire JPMorgan Chase for $650 billion, with funds coming from Bank of America, Abu Dhabi, the Federal Reserve, and anywhere else we can find, with next year's salary being just a little higher than the acquisition cost—$675 billion should be enough. Then we can change its name to TPMorgan Caught, and after paying off the loans, there would still be $25 billion left to play with.

Now, let's get back to the main topic.

This brings us to the immense computational, storage, and networking needs of xAI. The Grok-0 large language model, which covers 33 billion parameters, was trained in August 2023, just a few weeks after the establishment of xAI. The Grok-1, which features conversational AI for prompts and covers 314 billion parameters, was launched in November 2023. This model was open-sourced in March 2024, just before the introduction of the Grok-1.5 model, which, compared to Grok-1, has a larger context window and better average cognitive test scores.

As you can see, Grok-1.5 is slightly inferior in intelligence compared to its competitors from Google, OpenAI, and Anthropic.

The Grok-2 model is set to be released in August, and it was originally planned to be trained on 24,000 Nvidia H100 GPUs. It has been reported that the model is currently being trained on Oracle's cloud infrastructure. (Oracle has already signed an agreement with OpenAI to absorb any unused GPU capacity from xAI.)

Musk has indicated in multiple tweets that Grok 3 will be released by the end of this year, requiring a 100,000 Nvidia H100 GPU cluster for training, and will be on par with the future GPT-5 model being developed by OpenAI and Microsoft. Oracle and xAI are working hard to reach an agreement on GPU capacity. When the rumored $10 billion GPU cluster deal fell through three weeks ago, Musk quickly changed his strategy and built a "computing super factory" in an old Electrolux factory south of Memphis, Tennessee, to house his own 100,000 GPU cluster. If you live in Memphis, things are about to get a little crazy, as xAI hopes to secure 150 megawatts of electricity.

According to Bloomberg, the current power distribution capacity of the factory is 8 megawatts, which may increase to 50 megawatts in the coming months. To exceed this number, a large amount of documentation would be required from the Tennessee Valley Authority.

By the way, if you have a large supercomputer in Memphis, you absolutely cannot give it any nickname unless it is related to Elvis Presley. In the coming years, you can name consecutive machines after different stages of Elvis. You might want to name this machine "Hound Dog," which is a product of Elvis's early rock 'n' roll phase. However, if Musk cannot secure the full quota of 100,000 H100s before December (which seems unlikely unless Nvidia is willing to help), it might be called "Heartbreak Hotel."

Last week, when we were out due to a family medical emergency (which we have also recently experienced), Musk made the following remarks:Thus, it might be referred to as SuperCluster, which is the same term used by Meta Platforms when purchasing rather than developing their own AI training machines. (We prefer the name "Beagle" ourselves.)

We believe that the number of 100,000 GPUs is just a vision; perhaps by December, xAI will only have 25,000 GPUs, in which case it can still train very large models. Some reports we have seen suggest that the Memphis SuperCluster will not be fully scaled until late 2025, which we think is possible.

We can infer from the xitts of Charles Liang, the founder and CEO of Supermicro, that Supermicro is building a water-cooled machine for xAI to deploy in the Memphis data center:

The specific information about the server infrastructure is currently unclear, but we strongly suspect that this machine will be based on an eight-way HGX GPU board and will be a rack-mounted system from Supermicro, inspired by Nvidia's SuperPOD setup but with its own engineering adjustments and a lower price. With an eight-way HGX board, a total of 12,500 nodes can be achieved, with 100,000 GPUs and 100,000 endpoints in the backend network, and 12,500 endpoints in the frontend network for accessing data and managing nodes within the cluster.

Juniper Networks CEO Rami Rahim also spoke about being involved in the Memphis SuperCluster:

If you have just seen these tweets, you might conclude that Juniper has won the network deal for the Memphis SuperCluster in some way, which is indeed surprising considering the efforts of Arista Networks and Nvidia itself in AI cluster networking. We have not seen any news from Arista about this system, but on May 22, when Nvidia was discussing its financial performance for the first quarter of fiscal year 2025, CFO Colette Kress said:

"In the first quarter, we began delivering the all-new Spectrum-X Ethernet network solution, which is thoroughly optimized for AI. It includes our Spectrum-4 switch, BlueField-3 DPU, and new software technology to overcome the challenges of Ethernet AI, providing 1.6 times the network performance for AI processing compared to traditional Ethernet."The sales of Spectrum-X are continuously growing, with customers including multiple clients, among which is a large cluster with 100,000 GPUs. Spectrum-X has opened up a completely new market for Nvidia's networking and enables data centers that only use Ethernet to accommodate large-scale AI. We anticipate that the product line of Spectrum-X will soar to billions of dollars within a year.

Let's face the reality that there are currently not that many 100,000 GPU deals going on in the world, and with what we are now seeing from Musk's evaluation of the system, we are quite certain that the statement Nvidia made in May was about the Memphis supercluster. Therefore, we believe that Nvidia has the backend (or something) network part of the Spectrum-X device, while Juniper has the frontend (or north-south) network. And Arista claims to have none.

We have not yet seen any information on what kind of storage the Memphis supercluster will use. It could be a primary storage array based on Supermicro's flash and disk hybrid, running any number of file systems, or it could be an all-flash array from Vast Data or Pure Storage. If you were to hold a gun to our heads, we would boldly say that Vast Data is involved in this deal, securing a large amount of storage, but this is just a guess based on the company's momentum in developing large storage arrays in the HPC and AI fields over the past two years.

*Disclaimer: This article is the original creation of the author. The content of the article represents his personal views, and our reposting is only for sharing and discussion, does not represent our approval or agreement, if there are any objections, please contact the backstage.

Leave a Reply