When you're building a high-performance cluster, you've got to pay attention to the interconnects.
High performance clusters (HPCs) are the best way to process problems that involve performing repetitive operations across very large data sets, such as protein folding, computational fluid dynamics of airflow across an airplane wing, analyzing sensor readings from an oil field, or animating a wireframe movie. These sorts of problems involve massive data sets and extremely complex algorithms. The cluster works by breaking the problem down into many manageable chunks, which are then distributed to individual PCs (generally servers) to process. The results of those individual computations are assembled and then processed to provide the complete solution.
Physically, the HPC consist of a collection of individual servers that are connected by a high-speed network; each server is called a node and there may be dozens or hundreds of them in the cluster. One server in the HPC is a cluster controller; it manages the process by parceling out the individual chunks of work to the individual nodes, and then reassembling the results. That cluster controller is generally the only server in the HPC that's connected to the "real world"; the network that links the nodes together is typically a dedicated, private network. (Some people call that cluster controller the "head node.")
Depending on the specific architectural design of the HPC, there may also be a database server inside the cluster, on that private network; in other designs, the database server is outside the private network, and is accessed only by the cluster controller.
As you can imagine, data on that private network is flying around fast and furious. The traffic on that network is typified by several characteristics. First, the individual messages are fairly small, ranging from a few hundred kilobytes to a few tens of megabytes, depending on the application. Second, when a node is waiting for that traffic, it's waiting, and is not being productive. Third, because an HPC uses a dedicated network, it doesn't need the overhead, either computationally or in network bandwidth, of a full general-purpose TCP/IP stack.
Looking at Latencies, Bandwidth
In small high-performance clusters, the networking technology—referred to as the interconnect—is standard Ethernet, usually either Fast Ethernet or, increasingly, Gigabit Ethernet. However, as HPCs grow, the latency (delay times) with Ethernet-based interconnects becomes a significant factor in limiting the performance of the system. That has led HPC designers to look for alternative interfaces built on specialized hardware and software, such as Myrinet and InfiniBand, and most recently, PathScale's just-announced InfiniPath interconnect, which we'll focus on here.
What are the factors behind those performance numbers? There are several involved in the chain:
- The network drivers in the individual nodes' operating systems
- Implementation of MPI, the Message Passing Interface protocol used for packets
- Efficiency of the node's network card circuits and transceiver
- Bandwidth connecting the network card to the node's microprocessor
- Efficiency of the network switch to process the packet
To give an idea of the latency times that we're dealing with here, I studied several different tests of the various networking technologies.
According to most sources, with a 1-byte message, the minimum latency of TCP/IP is in the range of 60 to 95 microseconds. Myrinet, one of the most popular HPC interconnects, slashes that by an order of magnitude, to about 8-12 microseconds. InfiniBand, a newer technology that's just starting to gain traction, is about 5-8 microseconds. PathScale says that the latency of its new InfiniPath system is less than 1.5 microseconds when used with an AMD Opteron-based server.
A second key characteristic of an HPC interconnect is its raw bandwidth. According to Myricom, the folks behind Myrinet, their high-performance M3F2 network connector has theoretical peak data rates of 1067MB/sec on a 64-bit, 133MHz PCI-X slot; on a dual-processor 1.6GHz Opteron server, they report actual performance of 936MB/sec reading and 1032MB/sec writing across the system bus.
By contract, PathScale's specs for its newly announced HTX InfiniPath adapter describe a sustained 1.8GB/sec of bidirectional traffic when used on an Opteron-based server. We'll have to wait until the boards ship next year to get real-world benchmarks, but you've got to admit, that's an impressive specification.
Inside InfiniPath
InfiniPath's performance, both in latency and in bandwidth throughput, can be attributed to a number of factors.
One is that it's based on the InfiniBand networking technology, and uses InfiniBand switches for the network backbone.
InfiniBand is based on a switched fabric networking architecture, and uses connectionless protocols running over IPv6. In some ways, it's analogous to Fibre Channel SCSI, rather than being a turbo-charged Ethernet: It's designed as a specialized technology for I/O-based storage and cluster systems, instead of a general-purpose network architecture. Its overhead is minimal, in terms of header-to-payload. In the closed environment of an HPC, there's a limited address space, and minimal negotiation required between different network devices.
Another benefit is the way that PathScale has integrated InfiniPath with the HyperTransport architecture found in AMD's Opteron processors. In November 2004, the HyperTransport Technology Consortium, a trade association founded by AMD, has developed a dedicated expansion connector, called an HTX slot that provides a direct connection between cards and the motherboard's HyperTransport bus infrastructure.
You could think of the HTX slot as being somewhat analogous to the way that many desktop PCs have a dedicated AGP bus connector for video cards. The difference is that the HTX slot is designed for higher-performance I/O than you could accommodate with PCI or PCI-X. Performance is huge: The 16-bit interface, running at 800MHz, is initially rated with a bidirectional performance of 1.6 billions transfers per second—that's a theoretical maximum of 3.2Gbytes/sec. (While the HTX spec doesn't cover slots faster than 800MHz or wider than 16 bits, it doesn't take much imagination to see the possibilities.)
As far as I can tell, the first motherboard to have an HTX slot is from IWill and is expected to ship in the second quarter of 2005, the same time that PathScale plans to ship its InfiniPath HTX card.
PathScale has done more than build the HTX cards, though, to make its InfiniPath system work so fast; remember, end-to-end latency encompasses not just the hardware, but also the software stack. The company has developed a set of high-performance libraries for the Message Passing Interface (MPI) for working with Red Hat and SUSE Linux, and has also released new compilers for the AMD64 architecture that are optimized for high performance clusters.
High Speed, Low Latency
New applications and tools for HPC are coming out all the time, taking advantage of the 64-bit instruction set and advanced bus design of the AMD64 architecture. Cluster designers have long understood the need for low-latency interconnects, such as those from Myrinet, and the advantages that they offer over Ethernet-based transports. InfiniPath, purpose-built to leverage the Opteron processor's architecture, definitely will take this to the next level.
A former mainframe software developer and systems analyst, Alan Zeichick is principal analyst at Camden Associates, an independent technology research firm focusing on networking, storage, and software development.