Understanding GPU Direct

kyle on February 29, 2012

One of the most interesting features in the CUDA 4.0 release was the introduction of GPU Direct. GPU Direct is the marketing name given to two optimizations for data transfer between GPUs. Since the two optimizations are different (and are applicable to different scenarios), it’s a little confusing to refer to them both as GPU Direct, so I’ll distinguish by calling the first “GD-Infiniband” and the second “GD-Peer-to-Peer.”

In the course of explaining these optimziations, I’m going to talk a little about performance, and my results were measured on the Keeneland system . To understand them, you’ll need to know the architecture of the nodes of Keeneland (the HP SL390 with three NV M2070 GPUs) which is shown in the following block diagram:

GPU Direct – Infiniband

When transferring data on one GPU to any remote node in the cluster, the data must traverse several links in the system. First, the host initiates a copy from GPU RAM to host memory over the PCIe bus. Then, when an MPI call is used on that host memory, the data is copied to a buffer managed by the Infiniband driver. Finally, data in that buffer is sent out over the infinband cable. This whole process is shown in the following figure from a Mellanox whitepaper

GD-Infiniband is essentially the removal of an unnecessary copy in host memory (shown as copy 2 in the figure). This copy used to be necessary because the Infiniband driver and the GPU driver couldn’t share data. Now that NVIDIA and Mellanox are collaborating during driver development, both drivers can use the same pinned memory buffer, obviating the need for the second copy.

So, in terms of performance, the main benefit of GD-Infiniband is a decrease in overall communication time (the bandwidth bottleneck continues to be the PCIe bus), measured by NVIDIA at up to a thirty percent reduction. I haven’t confirmed this on Keeneland, largely because the initial implementation of GD-Infiniband required a kernel patch which was only compatible with a older version of the Linux kernel. Unfortunately we couldn’t deploy with this version, since it had some serious security problems. Right now, the plan is to add GD-Infiniband capability when the software stack gets upgraded for the final delivery system.

GPU Direct – Peer-To-Peer

The second GPU direct optimization is similar in that it also removes a memory copy. This time, however, that copy is removed by taking the host completely out of the picture when communicating data between GPUs on the same node. So, for Keeneland, this corresponds to transferring data between GPUs 1 and 2 in the block diagram above.

Prior to GD-Peer-to-Peer, you had to copy data back to the host first (with the familiar cudaMemCopy calls). To use GD-Peer-to-Peer, first you need to check and make sure it’s possible in your system with the following code. The first checks to see if you are using a compatible driver:

After that, you need to do one more check. This time, you need to make sure that the two GPUs are located on the same PCie root complex. This is because the GD-Peer-to-Peer optimization uses PCIe bus mastering. So, on Keeneland, p2p is possible only between GPUs 1 and 2. GPU 0 is on a different I/O hub, and since a GPU can’t bus master across QPI, the host must be involved for communication between GPU 0 and GPU 1/2. Luckily there is a nice wrapper for this:

If both conditions are met (correct driver and GPUs on same root complex), there is a nice performance increase from 2.8 GB/s to 4.9 GB/s for large data (1MB), which certainly makes it worth the API calls.