Over the last decade, workflows on high-performance computing (HPC) systems have greatly diversified, often blending AI/ML processing with traditional HPC. In response, a wide variety of specialized HPC computer systems (cluster nodes) have been designed and used to address specific application and framework performance optimization. Queues differentially targeting these systems allow each user to instruct the batch scheduler to dispatch jobs to hardware matched closely to their application’s computational requirements. High memory nodes, nodes with one or more accelerators, nodes supporting a high-performance parallel filesystem, interactive nodes, and hosts designed to support containerized or virtualized workflows are just a few examples of specialized node groups developed for HPC. Nodes might also be grouped in queues by the way they are interconnected.
The density and traffic requirements of the interconnected systems in a datacenter hosting an HPC cluster require topologies like the spine/leaf architecture, as shown in Figure 1. This picture becomes even more complex if HPC systems grow beyond the capacity of a single location and are distributed among multiple buildings or data centers. Traffic patterns involving inter-process communication, interactive access, shared filesystem I/O, and service traffic like NTP, DNS, and DHCP, some of which exhibit strong latency sensitivity, would otherwise have to compete for available bandwidth. Connectivity using the spine/leaf architecture address this problem by enabling routing algorithms that can provide a unique and unfettered path for any node-to-node communication.
Figure 1: Fabric topologies
HPC is now further evolving from nearly exclusively purpose-built on-premise infrastructure to hybrid or even fully cloud-resident architectures. The high cost of building, operating, and maintaining infrastructure to host dedicated HPC has challenged many government labs, companies, and universities to rethink the strategy of purpose-built HPC over the last couple of decades. Instead of purchasing the space, racks, power, cooling, data storage, servers and networking required to build on-premise HPC clusters, not to mention the staff and expense to maintain, and update these systems, all but the largest HPC practitioners are moving to a more usage-based model from cloud providers who offer HPC services. These changes have spurred a refocused investment in the internet connectivity and bandwidth needed to enable cloud bursting, data migration, and interactivity on cloud-resident infrastructure. This creates new challenges for developers who work to establish custom environments in which to develop and run application frameworks, often engendering complex software version interdependencies. The use of containerization has helped to isolate a lot of these software and library dependencies, making cloud migration simpler due to relaxed host image constraints.
HPC Network Infrastructure Considerations for 400G/800G Ethernet
Internet service providers and carriers who are responsible for delivering all of this traffic depend on technologies growing at a steady and reliable pace, and are of course, highly cost conscious as their bottom line is related to the investment of building out, upgrading, and managing the operational cost of network infrastructure. Hyperscalers and cloud service providers also face increased cost pressures to aggregate and reduce the number of switch devices, power utilization and cooling demands in their datacenters.
Cost is not the only factor to consider when driving Ethernet to these new speed heights. PAM-4 signaling illustrated in Figure 2 was initially introduced at a 25 Gb/s signaling rate as an enabler for 100G Ethernet, but this method requires forward error correction (FEC) due to elevated bit error rates. Signaling changes incorporating FEC create both latency overhead and complexity for physical layer design, but even faster signaling rates make the use of FEC mandatory. While link aggregation of multiple 100 Gb/s ports to achieve higher bandwidth, which is still possible with NRZ signaling rates may be a temporary fix to this problem, it is not a long-term solution due to the density constraints it entails as well as the elevated cost of the exponentially larger port counts that are required. For beyond 400G Ethernet, alternatives to PAM-4 offering even greater signal density and longer must be leveraged.
Figure 2 – High speed Ethernet signaling
Cabling is yet another challenge for high-speed Ethernet. Copper cables are often too noisy and power hungry at these speeds even over short distances Optical cables must get closer to the core physical coding subsystem (PCS) layer to avoid signal loss and power demands introduced through the use of external electrical photonic connectors. One use case requires break out cabling options, as multiple computer systems could be supported by a single switch port of sufficient elevated bandwidth. Another use case focuses on aggregation layer switch to switch, or site to site connectivity. Dense Wavelength Division Multiplexing (DWDM) for long haul connections (ca 80 km per repeated segment) and single mode fiber (SMF) for shorter range connections will gradually replace multi-mode fiber and copper technologies to enable 200 Gb/s signaling rates, but 100G electrical signaling rates and multimode fiber cost advantages will be hard to overcome and replace over the next few years. CWDM and DWDM introduce coherent optical signaling as an alternative to PAM-4, but entail even greater power, cost, and complexity to achieve the longer reaches they enable. Within the datacenter, the pressures of backwards compatibility, switch aggregation and switch count reduction and potential for power savings are powerful inducements for a flexible on-board optics design that could also accommodate existing pluggable modules for downrate connectivity.
Enabling 400G/800G Ethernet with IP
So how do SoC designers develop chips to support 400G Ethernet and beyond? Network switches and computer systems must use components that support these elevated data rates to deliver the application acceleration they promise. Whether reducing complexity on a network fabric for achieving greater levels of aggregation, expansion of a hyperscaler’s infrastructure beyond the limits previously imposed by slower network technologies, or speeding up the delivery of data to a neural network running on a group of network-connected computers – all elements in the data path must be able to support the lower latencies and higher bandwidth required without excessive power or cost penalties. And of course, backwards compatibility with slower components will ensure the seamless adoption and integration of 400G/800G Ethernet and beyond into existing datacenters.
Delivering this performance in 400G/800G networking involves multiple challenges in the physical and electronic realms. Electrical efficiencies with the faster clock speeds, parallel paths, and complex signaling requirements are difficult to achieve and elevated error rates intrinsic to faster communication speeds create the need for a highly efficient FEC to ensure minimum latency with low retransmission rates. As mentioned earlier, cabling media must support the elevated data rates over rack, datacenter, and even metropolitan scales. No one cabling technology is ideal across such a diverse range of lengths so multiple media types must be supported by any solution developed.
SoC designers need silicon IP developed with all of these things in mind, Synopsys has been a leading developer of Ethernet silicon IP for many generations of the protocol and remains integral in pushing standardization for 400G/800G Ethernet and beyond. Synopsys offers an integrated 400G/800G Ethernet IP solution which is compliant to industry standards and configurable to meet the various needs of today’s HPC even with AI/ML workloads while maintaining backwards compatible to lower speeds and older standardization.
About the Author: Jerry Lotto, Sr. Technical Marketing Manager
Jerry Lotto brings more than 35 years of experience in Scientific / High Performance Computing. Jerry built the first HPC teaching cluster in Harvard’s Department of Chemistry and Chemical Biology with an InfiniBand backbone. In 2007, Jerry helped to create the Harvard Faculty of Arts and Sciences Research Computing group. In an unprecedented collaborative effort between 5 Universities, industry and state government, Jerry also helped to design the Massachusetts Green High-Performance Computing Center in Holyoke, MA which was completed in November 2012.