skip to main content

Configuring AMD xGMI Links on the Lenovo ThinkSystem SR665 V3 Server

Planning / Implementation

Home
Top
Published
16 Nov 2023
Form Number
LP1852
PDF size
13 pages, 1.1 MB

Abstract

AMD Infinity Architecture, introduced with the 2nd Gen AMD EPYC Processors, provides the data path and control support to interconnect core complexes, memory, and I/O. Each Core Complex Die (CCD) connects to the I/O die via a dedicated high-speed Global Memory Interface (GMI) link. The socket-to-socket Infinity Fabric increases CPU-to-CPU transactional speeds by allowing multiple sockets to communicate directly through dedicated external GMI (xGMI) lanes.

AMD EPYC 9004 Series Processors support up to four xGMI links with speeds up to 32Gbps. Lenovo ThinkSystem SR665 V3 has a flexible xGMI link design allowing 1 xGMI link to be converted to additional PCIe 5.0 lanes.

This paper demonstrates how to change xGMI hardware configuration, introduces related UEFI settings, compares the performance between four xGMI links vs. three xGMI links, and provides configuration recommendations for different use cases.

This paper is suitable for customers, business partners, and sellers who wish to understand xGMI link settings and performance impact on Lenovo ThinkSystem servers SR665 V3 with AMD 4th Generations EPYC processors.

Infinity Fabric

Infinity Fabric (IF) is a proprietary AMD designed architecture that connects and facilitates data and control transfer between all components. IF is implemented in most of AMD's recent microarchitectures in their EPYC processors and other products.

In the AMD EPYC Processor, the Core Complex Dies (CCD) connect to memory, I/O, and each other through an updated I/O Die (Figure 1). Each CCD connects to the I/O die via a dedicated high-speed Global Memory Interconnect (GMI) link. The I/O die helps maintain cache coherency and additionally provides the interface to extend the infinity fabric to a potential second processor via its xGMI. AMD EPYC 9004 Series Processors support up to 4 xGMI with speeds up to 32Gbps.

AMD 4th Gen EPYC processor I/O die function logical view
Figure 1. AMD 4th Gen EPYC processor I/O die function logical view (source: AMD)

xGMI Configurations on SR665 V3

The Lenovo ThinkSystem SR665 V3 is a 2U 2-socket server that features 4th Generation AMD EPYC processors. Figure 2 shows the architectural block diagram of the SR665 V3, showing the major components and their connections. Note that one of the xGMI links between the processors can be interchanged with two PCIe 5.0 x16 connections. These PCIe connections can be utilized for additional PCIe/NVMe support.

SR665 V3 system architectural block diagram
Figure 2. SR665 V3 system architectural block diagram

The cable inside the red box in Figure 3 is a 2xSwiftX8-2*SwiftX8 cable (PN SC17B23935). Installing this cable results in the 4 xGMI link configuration. Removing this cable results in the 3 xGMI link configuration. In the removed configuration, the connectors can be used to connect additional backplane to support 8 more NVMe devices.

xGMI cable location on SR665 V3
Figure 3. xGMI cable location on SR665 V3

Theoretical Value Analysis

AMD EPYC 9004 Series Processors incorporating PCIe Gen 5 capabilities onto the I/O die, use the same physical interfaces for Infinity Fabric connections. xGMI is one such connection utilizing different protocols layered on the same PHY layer. Note that xGMI Links are bidirectional hence multiplying by two directions.

Table 1. Theoretical xGMI BW value with different configurations
xGMI Config Bisection Theoretical BW
3 links x16 32Gbps 3 x 16 x 32Gbps / 8 x 2 directions = 384 GB/s
4 links x16 32Gbps 4 x 16 x 32Gbps / 8 x 2 directions = 512 GB/s
3 links x4 16Gbps 3 x 4 x 16Gbps /8 x 2 directions = 48 GB/s
4 links x4 16Gbps 4 x 4 x 16Gbps /8 x 2 directions = 64 GB/s
Table 2. Single Socket Theoretical Memory BW with different numbers of DIMMs installed
Memory Config Memory Theoretical BW
12 x 4800MHz 12 x 4800 x 64bit / 8 = 460800 MB/s = 460.8 GB/s
10 x 4800MHz 10 x 4800 x 64bit / 8 = 384000 MB/s = 384 GB/s
8 x 4800MHz 8 x 4800 x 64bit / 8 = 307200 MB/s = 307.2 GB/s
6 x 4800MHz 6 x 4800 x 64bit / 8 = 230400 MB/s= 230.4 GB/s
4 x 4800MHz 4 x 4800 x 64bit / 8 = 153600 MB/s = 153.6 GB/s
2 x 4800MHz 2 x 4800 x 64bit / 8 = 76800 MB/s = 76.8 GB/s
1 x 4800MHz 1 x 4800 x 64bit / 8 = 38400 MB/s = 38.4 GB/s

Four xGMI links can support a maximum theoretical bandwidth of 512 GB/s between sockets, which more than matches the maximum single socket theoretical memory bandwidth of 460.8 GB/s. This means remote memory access can flow nearly at maximum bandwidth from one CPU to another.

Three xGMI links maximum theoretical bandwidth is 384 GB/s, the same as 10 channels' maximum theoretical memory bandwidth. This means remote memory access can flow nearly at maximum bandwidth from one CPU to another when 10 or fewer DIMMs per socket.

Performance Test Benchmark

STREAM Triad is a simple, synthetic benchmark designed to measure sustainable memory bandwidth throughput. The goal is to measure the highest memory bandwidth supported by the system. STREAM Triad will be used to measure the sustained memory bandwidth of various xGMI and NUMA configurations. Unless otherwise stated, all test configurations were done using 96GB 2R RDIMMs running at 4800 MHz.

For more information about STREAM Triad, see the following web page:
http://www.cs.virginia.edu/stream/

STREAM is a NUMA-aware workload. The NUMA-aware architecture is a hardware design that separates all cores into multiple clusters where each cluster has its own local memory region and promotes working within that region; cores are still allowed to access all memory in the system. Firmware will attempt to interleave all memory channels on each quadrant of the socket (NPS4), half of the socket (NPS2), or the whole socket (NPS1) resulting in multiple nodes within the system.

Each node contains a subset of all CPUs and memory. The access speed to the main memory is determined by the location of the data relative to the CPU. Since STREAM is NUMA-aware, the application assesses data that is local to the CPU the thread is working on to get better performance. This means that there is minimal cross socket talk thus the xGMI impact is minimal.

The following table shows the stream test result of 3 xGMI link config systems and 4 xGMI link config systems in the different operating modes when NPS4. The Whole system's memory bandwidth of 3 xGMI link config system is very close to 4 xGMI link config system in the same operating mode. We can see similar behavior when NPS1&2.

Table 3. 24 DIMMs Memory BW with NPS4
xGMI HW Config UEFI Settings Stream Triad (GB/s)
3 links Maximum Performance Mode
NUMA Nodes per Socket = NPS4
742
4 links Maximum Performance Mode
NUMA Nodes per Socket = NPS4
744
3 links Maximum Efficiency Mode
NUMA Nodes per Socket = NPS4
716
4 links Maximum Efficiency Mode
NUMA Nodes per Socket = NPS4
715

The following table shows the stream memory bandwidth result when NPS0. NPS0 effectively means one NUMA node for the entire system. It is only available on a 2-socket system. Firmware will attempt to interleave all memory channels in the system. Since there are no local nodes for the application to leverage, there is far more crosstalk between sockets to transfer data on the xGMI links. The xGMI link number, speed, and width all limit the bandwidth.

The STREAM Triad test results at NPS0 show the impact of limiting those variables as the results are close to the xGMI Theoretical BW value in Table 1. Note xGMI Maximum Link Width = x16 and xGMI Max Speed = 32Gbps in the Maximum Performance Mode. We need to change the Operating Mode to Custom Mode if we want to change the variables' value.

Table 4. 24 DIMMs Memory BW with NPS0
xGMI HW Config UEFI Settings Stream Triad (GB/s)
3 links Maximum Performance Mode-> Custom Mode
3-Link xGMI Max Speed = 32Gbps
xGMI Maximum Link Width = x16
NUMA Nodes per Socket = NPS0
373
4 links Maximum Performance Mode -> Custom Mode
3-Link xGMI Max Speed = 32Gbps
xGMI Maximum Link Width = x16
NUMA Nodes per Socket = NPS0
491
3 links Maximum Performance Mode -> Custom Mode
3-Link xGMI Max Speed = Minimal[16Gbps]
xGMI Maximum Link Width = x4
NUMA Nodes per Socket = NPS0
47
4 links Maximum Performance Mode -> Custom Mode
4-Link xGMI Max Speed = Minimal[16Gbps]
xGMI Maximum Link Width = x4
NUMA Nodes per Socket = NPS0
63

Memory Latency Checker (MLC) is a tool used to measure memory latencies and bandwidth. It also provides options for local and cross-socket memory latencies and bandwidth checks.

For more information about MLC, see the following web page:
https://www.intel.com/content/www/us/en/download/736633/763324/intel-memory-latency-checker-intel-mlc.html

We use the following commands to print a local and cross-socket memory latencies/bandwidth matrix:

mlc --latency_matrix
mlc --bandwidth_matrix

The following table shows the local and cross-socket memory latency and bandwidth on the 4 xGMI config system with different xGMI link width and speed. The local node latency and bandwidth don’t impact by xGMI link status , but for the remote node bandwidth and latency, faster speed and larger width result lower latency and higher bandwidth.

Table 5. 4 xGMI Links System Local and Cross-socket Memory Latencies/Bandwidth with NPS1
xGMI Width/Speed Local Node Latency (ns) Remote Node Latency (ns) Local Node Bandwidth (GB/s) Remote Node Bandwidth (GB/s)
X4 / 16 Gbps 110.5 351.8 369.6 24.8
X4 / 32 Gbps 110.8 250.7 369.8 50.1
X16 / 16 Gbps 110.4 244.6 369.8 90.6
X16 / 32 Gbps 110.4 199.1 370.2 152.2

The following table shows the result of the 3 xGMI config system, and we can get the same conclusion.

Table 6. 3 xGMI Links System Local and Cross-socket Memory Latencies/Bandwidth with NPS1
xGMI Width/Speed Local Node Latency (ns) Remote Node Latency (ns) Local Node Bandwidth (GB/s) Remote Node Bandwidth (GB/s)
X4 / 16 Gbps 111.2 348.8 369.8 18.6
X4 / 32 Gbps 110.8 245.6 370.3 37.5
X16 / 16 Gbps 111.2 243.6 370.3 68.9
X16 / 32 Gbps 110.4 196.9 370.2 116.3

Figures 8-9 are the remote node latency/bandwidth comparison between 3 xGMI links system and 4 xGMI links system. We can see the remote node latency is very close, but 4 xGMI has much better remote node bandwidth than 3 xGMI, and the ratio is close to 4:3, which is the same as number of links compared. Besides the number of xGMI links, the Remote node bandwidth is also scalable with the xGMI width and speed.

Remote Node Latency compares between 3 xGMI links and 4 xGMI links
Figure 8. Remote Node Latency compares between 3 xGMI links and 4 xGMI links

Remote Node Bandwidth compares between 3 xGMI links and 4 xGMI links
Figure 9. Remote Node Bandwidth compares between 3 xGMI links and 4 xGMI links

Summary

The ThinkSystem SR665 V3 has flexible xGMI inter-processor links allowing one link to be converted to two x16 PCIe 5.0 connections, which can provide more PCIe connections for greater PCIe/NVMe support.

Four xGMI links maximum theoretical bandwidth is greater than 12 Channels 4800MHz DDR5 BW, which means remote memory access can flow nearly at maximum bandwidth from one CPU to another. Three xGMI links may be acceptable for NUMA-aware workloads or reduced memory population.

xGMI link speed and width are configurable in the UEFI. For NUMA-aware workloads, reduced link speed and width can save uncore power to reduce overall power consumption and divert more power to the cores for increased core frequency.

For those NUMA-unaware workloads, when accessing the memory attached directly to CPU 0, CPU 1 must cross the xGMI link between the two sockets. This access is “non-uniform”, CPU 0 will access this memory faster than CPU 1 because of the distance between two sockets. The xGMI link number, speed, and width will impact the overall performance at that time.

Authors

Peter Xu is a Systems Performance Verification Engineer in the Lenovo Infrastructure Solutions Group Performance Laboratory in Morrisville, NC, USA. His current role includes CPU, Memory, and PCIe subsystem analysis and performance validation against functional specifications and vendor targets. Peter holds a Bachelor of Electronic and Information Engineering and a Master of Electronic Science and Technology, both from Hangzhou Dianzi University.

Redwan Rahman is a Systems Performance Verification Engineer in the Lenovo Infrastructure Solutions Group Performance Laboratory in Morrisville, NC, USA. His current role includes CPU, Memory, and PCIe subsystem analysis and performance validation against functional specifications and vendor targets. Redwan holds a Bachelor of Science in Computer Engineering from University of Massachusetts Amherst.

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
ThinkSystem®
X4

Other company, product, or service names may be trademarks or service marks of others.