This reference architecture is focused on 4 SU units with 128 DGX nodes. DGX SuperPOD can scale to much larger configurations up to and beyond 64 SU with 2000+ DGX H100 nodes. See Table 3 for more information.
Table 7 Major components of the 4 SU, 127-node DGX SuperPOD Count Component Recommended Model Racks 38 Rack (Legrand) NVIDPD13 Nodes 127 GPU nodes DGX H100 system 4 UFM appliance NVIDIA Unified Fabric Manager Appliance 3.1 5 Management servers Intel based x86 2 × Socket, 24 core or ...
Figure 9. SN4600C switch Out-of-Band Management Network# Figure 10 shows the OOB Ethernet fabric. It connects the management ports of all devices including DGX and management servers, storage, networking gear, rack PDUs, and all other devices. These are separate onto their own fabric ...
(GDS) provides a way to read data from the remote filesystem or local NVMe directly into GPU memory providing higher sustained I/O performance with lower latency. Using the storage fabric on the DGX SuperPOD, a GDS-enabled application should be able to read data at over 40 GBp...
Figure 12. DGX SuperPOD architecture overview NVIDIA Base Command# NVIDIA Base Commandpowers every DGX SuperPOD, enabling organizations to leverage the best of NVIDIA software innovation. Enterprises can unleash the full potential of their investment with a proven platform that includes enterprise-grad...
The DGX SuperPOD is optimized for a customers’ particular workload of multi-node AI, HPC, and Hybrid applications: A modular architecture based on SUs of 32 DGX H100 systems each. A fully tested system scales to four SUs, but larger deployments can be built based on customer ...