Deconstructing Nvidia’s Vera Rubin — The Successor To Blackwell That’s 10x More Efficient

By CNBC

Share:

Vera Rubin: A Deep Dive into Nvidia’s Next-Gen AI Data Center System

Key Concepts:

  • Vera Rubin: Nvidia’s next-generation rack-scale AI data center system, designed to address energy efficiency bottlenecks.
  • Grace Blackwell: The current generation rack-scale system, featuring 72 GPUs and serving as a foundation for Vera Rubin.
  • NVLink: Nvidia’s high-speed interconnect technology, crucial for GPU-to-GPU communication within the rack.
  • DPU (Data Processing Unit): Processors designed for data-centric tasks like storage and security, exemplified by Nvidia’s BlueField.
  • HBM4 (High Bandwidth Memory 4): The latest generation of high-performance memory used in Rubin GPUs.
  • Rack Scale Design: A holistic approach to data center infrastructure, integrating compute, networking, and memory as a single, optimized unit.
  • AI Factory: A large-scale deployment of AI infrastructure, comprising numerous interconnected racks.
  • SoCAMM: A type of memory that can be applied and removed on the Vera Rubin processor.

1. Introduction: The Demand for AI and the Need for Efficiency

The video details Nvidia’s latest advancements in AI infrastructure, focusing on the Vera Rubin system. Driven by soaring demand from major players like Microsoft, Google, Amazon, and Meta, Nvidia is addressing the critical bottleneck of energy consumption in AI buildouts. Vera Rubin is projected to deliver ten times the performance per watt compared to the Blackwell system, representing a significant leap in efficiency. Nvidia’s stock has risen over 100% since the announcement of Blackwell, demonstrating the market’s response to the rack-scale design approach.

2. The Complexity of Vera Rubin: A Massive Ecosystem

Vera Rubin is not simply a collection of GPUs; it’s a complex system comprised of 1.3 million components sourced from over 80 suppliers across more than 20 countries. The system is currently in volume production with shipments planned for later this year. Key components and suppliers include:

  • Silicon & Core Chips: TSMC
  • Rack Assembly: Foxconn
  • Liquid Cooling: Delta Electronics
  • Connectors & Copper: Amphenol
  • Cooling Distribution: Vertiv
  • Power Shelves: MegMeet, LiteOn, Flex
  • Monolithic Power Systems: Infineon, Analog Devices, ST-Microelectronics
  • Chassis: Foxconn, Interplex
  • Busbars: Bizlink
  • Rack Manifolds: Pinda
  • Cold Plates: Auras, AVC, Boyd, Coolermaster
  • Power Whips: JPC, Recodeal

Nvidia created a standard reference design to facilitate this complex supply chain, opening it up to a wider ecosystem of manufacturers.

3. Grace Blackwell vs. Vera Rubin: A Comparative Analysis

Grace Blackwell, the current generation, features 72 GPUs and approximately 1.2 million components. Vera Rubin, building upon this foundation, incorporates roughly 100,000 more components and utilizes double the energy but delivers significantly more compute power. Specifically:

  • Vera Rubin Pod: Contains 1,152 GPUs across 16 racks.
  • Compute Density: Vera Rubin generates an exponentially higher number of tokens compared to previous generations.
  • Vera CPU: Delivers two times the performance per watt compared to the previous generation Grace CPU.
  • Rubin GPU: Capable of delivering approximately 50 petaflops of AI performance – 2.5x the performance of its predecessor.
  • Superchip: Each Vera Rubin superchip contains 17,000 components.
  • Memory: Vera Rubin utilizes removable SoCAMM memory units, unlike the soldered-in memory of Grace Blackwell. It also features eight stacks of HBM4 memory from SK Hynix and Samsung.

4. Addressing Key Challenges: Supply Chain, Cooling, and Reliability

Nvidia is proactively managing several risks associated with Vera Rubin’s deployment:

  • Supply Chain: Nvidia is providing detailed forecasts to suppliers and aligning with them to ensure component availability, particularly for scarce resources like HBM4.
  • Overheating: Early Blackwell deployments experienced overheating issues due to improper liquid cooling valve seeding and implementation errors. These issues have largely been resolved.
  • Liquid Cooling Infrastructure: Vera Rubin is 100% liquid cooled, requiring data centers to adopt robust liquid cooling loops. However, liquid cooling systems actually reduce overall water consumption by minimizing the need for evaporative cooling technologies. Vera Rubin racks consume approximately 220kW of power.
  • Tariffs: The complex supply chain is susceptible to tariff fluctuations, but Nvidia is leveraging demand to secure necessary components.

5. The Role of NVLink and Networking

Efficient data transfer is paramount. Nvidia addresses this with:

  • NVLink: The NVLink Switch chip doubles the line rate from 1.8TB/s to 3.6TB/s, connecting all GPUs and CPUs within the rack. Nine NVLink Switch trays connect the 72 GPUs, achieving a combined data transfer rate of 260TB/s. The system utilizes 5000 copper cables, totaling two miles in length, to facilitate this connectivity.
  • BlueField DPUs: Handle storage and security tasks.
  • ConnectX-9 Networking Controllers: Originally developed by Mellanox (acquired by Nvidia for nearly $7 billion), these controllers provide high-speed networking capabilities.
  • Spectrum-X Switches: Dedicated networking racks filled with Nvidia’s latest Spectrum-X switches connect multiple Vera Rubin racks to form an “AI factory.”

6. Future Outlook: Kyber and Beyond

Nvidia is already looking ahead with the Kyber architecture, a future rack-scale system featuring 288 GPUs. Kyber aims to further increase compute density and efficiency, with a 50% increase in weight despite quadrupling the GPU count. The design focuses on reducing cabling and connection points to improve reliability and reduce total cost of ownership. Nvidia’s long-term strategy involves continuous architectural leaps, encouraging customers to adopt new generations rather than phasing out older systems.

7. Competitive Landscape and Industry Trends

While Nvidia dominates the AI infrastructure market, competition is emerging:

  • AMD Helios: AMD’s first rack-scale system is expected to ship later this year, providing a second source for customers.
  • In-House Silicon: Major cloud providers like AWS (Trainium 2), Google (TPUs), and Microsoft/Meta are developing their own AI chips, but continue to rely on Nvidia’s platforms.

Quote: “I don’t know if we’re the only company that can do it, but I can definitely say that there’s a lot of, you know, sort of growing pains that went into understanding the complexity of delivering this type of system that has never been designed before at this scale.” – Nvidia Representative.

Conclusion:

Nvidia’s Vera Rubin represents a significant advancement in AI data center infrastructure, prioritizing energy efficiency and scalability. The system’s success hinges on a complex global supply chain, innovative cooling solutions, and high-speed interconnect technologies like NVLink. While challenges remain, Nvidia’s proactive approach to supply chain management and continuous innovation positions it as a leader in the rapidly evolving AI landscape. The focus on performance per watt and the continuous development of new architectures like Kyber demonstrate Nvidia’s commitment to pushing the boundaries of AI computing.

Chat with this Video

AI-Powered

Hi! I can answer questions about this video "Deconstructing Nvidia’s Vera Rubin — The Successor To Blackwell That’s 10x More Efficient". What would you like to know?

Chat is based on the transcript of this video and may not be 100% accurate.

Related Videos

Ready to summarize another video?

Summarize YouTube Video