RAS Features of the Lenovo ThinkSystem SR950 and SR850Article
Abstract
Server downtime is very costly to enterprises, especially business or mission critical workloads. Always-on has become a global requirement and impacts almost every aspect of our lives. The Lenovo ThinkSystem SR850 and SR950 contain multiple levels of RAS capabilities to ensure the servers maintain the highest level of Reliability, Availability and Serviceability (RAS).
Introduction
Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT shops that deliver mission-critical applications and services, and application delivery failures can be extremely costly per hour of system downtime. In addition, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments.
Mission-critical applications such as database, enterprise resource planning (ERP), customer resource management (CRM), and business intelligence (BI) applications need to be available 24/7 on a wide area or global basis.
While clustering and virtualization can help meet availability requirements, they are not adequate solutions for very large databases, BI, and high-end transactional systems. A failure affecting a single core business application can easily cost hundreds of thousands or even millions of dollars per hour. All this leads to a need for scalable and highly resilient servers that are well suited for critical business applications and large-scale consolidation.
Always On
Time is money. Even a few minutes of downtime can result in significant costs and cause internal business operations to come to a standstill. Downtime can also adversely impact a company’s relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company’s reputation and result in lost business.
The growth of new applications has ratcheted database processing and business analytics to the top of the list for server workloads. These workloads demand continuous availability from the enterprise platforms on which they run.
"Always on" has become a global requirement and impacts many aspects of our lives:
- Maximize productivity - Manufacturers need to keep their production line up and running. System downtime should not interrupt it.
- Control access - Facility Security companies prevent external threats to organizations. Security application downtime shouldn't be an internal threat.
- Protect profit - Retailers have sales targets to meet day in, day out. Transaction system downtime shouldn’t get in the way.
- Protect lives - First Responders take care of emergencies 24 x 7 x 365. Application downtime shouldn’t be one of them.
- Ensure quality care and privacy - Healthcare Institutions need to access patient information and be HIPPA compliant all the time. System downtime shouldn’t compromise either one.
- Process transactions - Financial Services organizations manage thousands of transactions a second. Processing system downtime simply can’t happen
The Cost of Downtime
The ITIC 2016 survey found that 98% of organizations say that a single hour of downtime costs over $100,000; 81% of respondents indicated that 60 minutes of downtime costs their business over $300,000 and a record one-third or 33% of enterprises report that one hour of downtime costs their firms $1 million to over $5 million.
Figure 1. Cost of hourly downtime in enterprises, 2016-2017
Server RAS Defined
RAS in relation to servers is defined as follows:
Reliability – Reducing the mean time between hardware failures and ensuring data integrity. Data integrity is protected through error detection and correction — or, if not correctable, error containment
- Error Detection and Self-Healing
- Minimizes outage opportunities
- Correct results continously
Availability – Refers to uninterrupted system and application operation even in the presence of uncorrectable errors
- Reduce frequency and duration of outages
- Self-diagnosing: work around faulty components or “self-heal”
- Never stops or slows down
Serviceability – Means a system can be maintained without disrupting operation. This capability requires both thoughtful platform design and innovative systems management.
- Avoid repeat failures with accurate diagnostics
- Concurrent repair on higher failure rate items
- Easy to repair and upgrade
Key RAS Features of the SR950 and SR850
The ThinkSystem SR950 and ThinkSystem SR850 support self-monitoring and self-healing capabilities. This technology enables the server to monitor key sub-systems for errors, and automatically repair known issues.
Figure 2. SR850 (top) and SR950 (bottom)
Detecting and correcting problems (or isolating problems that cannot be immediately rectified) is important to maintain system integrity and protect mission-critical data. Support for multiple layers of system component redundancy and subsequent automated failover functionality ensures a higher level of availability. The SR950 and SR850 take advantage of predictive failure analysis to identify problematic components before they fail, allowing them to be replaced during regular maintenance cycles, and ultimately minimizing service costs.
Lenovo platform RAS innovation features include:
- Automated processor failover
- Automated firmware backup
- Automated memory page sorting and page retire
- Advanced transaction recovery
The servers also offer solution-level RAS with software stack integration:
- VMware virtualization
- Microsoft virtualization
The servers have the following RAS features:
- Provides Single Device Data Correction (SDDC, also known as Chipkill), Adaptive Double- Device Data Correction (ADDDC, also known as Redundant Bit Steering or RBS), memory mirroring, and memory rank sparing for redundancy in the event of a non-correctable memory failure.
- The Dual M.2 Boot Adapter supports RAID-1 which enables two installed M.2 drives to be configured as a redundant pair.
- Hot-swap redundant power supplies and hot-swap redundant fans to provide availability for mission-critical applications
- The power-source-independent light path diagnostics uses LEDs to lead the technician to failed (or failing) components, which simplifies servicing, speeds up problem resolution, and helps improve system availability
- LCD system information display panel provides more detailed diagnostics by displaying all error messages and VPD data needed for a service call, thereby aiding with problem resolution and system uptime
- Hot-swap drives, supporting RAID redundancy for data protection and greater system uptime
- Solid-state drives (SSDs) offer more reliability than traditional mechanical HDDs for greater uptime
- Proactive Platform Alerts (including PFA and SMART alerts): Processors, voltage regulators, memory, internal storage (SAS/SATA HDDs and SSDs, NVMe SSDs, M.2 storage, flash storage adapters), fans, power supplies, RAID controllers, server ambient and subcomponent temperatures.
RAS Features with Lenovo XClarity
In addition to the above SR950 and SR850 key RAS features, Lenovo XClarity which is a centralized systems management solution continuously monitors system parameters, triggers alerts, and performs recovery actions in case of failures to minimize downtime.
XClarity has the following RAS features:
- Provides the tools needed to enable administrators to deploy platforms more quickly and manage them easier.
- Allows servers even ‘call home’ if they detect an issue, so a potential problem may be fixed before it occurs.
- XClarity Provisioning Manager collects and saves service data to USB key drive or remote CIFS share folder, for troubleshooting and to reduce service time.
- XClarity Administrator Mobile app running on a supported smartphone and connected to the server through the service-enabled USB port, enables additional local systems management functions.
- Auto restart in the event of a momentary loss of AC power (based on the power policy setting in the XClarity Controller service processor)
- Collects and downloads diagnostic data, including logs, service data, and inventory to help identify the cause of the issue.
Unique Hardware Serviceability
The design of the ThinkSystem SR950 is based on a modular service model where access is from the front and rear only. This means that nearly all parts can be removed from the front or rear of the system, even parts that are located in the center of the server (e.g., fans, memory DIMMs, and processors). This design helps reduce time and cost associated with installing and maintaining systems, and can reduce the chance of errors occurring while working with the system.
Figure 3. Fans are accessible from the front of the SR950 server
To learn more about the design and usability of SR950, read the article Usability in the Design of the ThinkSystem SR950.
Intel RAS Features
The Intel Xeon Scalable Family processors offer Advanced and Standard RAS features.
- Bronze and Silver processors support Standard RAS features
- Gold and Platinum processors support Standard and Advanced RAS features.
The SR950 and SR850 use Gold and Platinum processors exclusively so they offer both Standard and Advanced RAS features.
The following table lists the Intel Advanced RAS features.
Advanced RAS features | Category | Benefit |
---|---|---|
Viral Mode of error containment | Reliability | Enhanced error containment to improve data integrity, complimentary to corrupt data containment mode |
MCA Recovery-Execution path | Reliability | OS layer assisted recovery from uncorrectable data errors to prevent system reset |
MCA Recovery-Non execution path | Reliability | OS layer assisted recovery from uncorrectable data errors detected by Patrol scrubber or LLC Explicit Write Back |
Local Machine Check (LMCE) based Recovery | Reliability | Enhances MCA recovery-Execution path event, and increases the possibility of recovery |
SDDC +1, Adaptive DDDC (MR) +1 | Reliability | Adaptive virtual lockstep delivers up to two DRAM Device corrections. Also supports Single DRAM correction, as well as single bit correction post final DRAM device map out. |
PCI Express Live Error Recovery | Reliability | PCI-e root port error containment, and the opportunity to dynamically recover from the error |
Intel® UPI Dynamic Link width reduction | Availability | Enables interconnect to continue operation in presence of Interconnect link persistent failure |
Address range/Partial Memory Mirroring | Reliability | OS managed memory mirroring of selective ranges, increases data integrity at efficient cost |
MCA 2.0 Recovery (as per eMCA gen2 architecture) | Reliability | Firmware first model enables a reliable error sourcing capability with the ability to write to the MSR |
The following table lists the Intel Standard RAS features.
Standard RAS features | Category | Benefit |
---|---|---|
Advanced Error Detection and Correction (AEDC) | Reliability | Enhanced fault coverage within processor cores, and attempt to recover via instruction retry |
Error Detection and Correction | Reliability | Extensive Error detection and correction capability across the silicon, and the interconnects. |
Corrupt Data containment-Core | Reliability | Uncorrectable data explicitly marked and delivered synchronously to the consuming core to assist error containment and increase system reliability |
Corrupt Data containment-UnCore | Reliability | Uncorrectable data explicitly marked and delivered synchronously to the requestor, to assist error containment and increase system reliability |
SDDC, Adaptive Data Correction (SR) | Reliability | Adaptive virtual lockstep delivers single DRAM Device corrections, at bank granularity. Also supports Single DRAM correction. |
PCIe “Stop and Scream” | Reliability | PCI-e root port corrupt data containment feature, increases data integrity |
Memory Mirroring- Intra iMC | Reliability | Increase data integrity by creating a redundant/mirrored copy of data in system DRAM |
DDR4 memory RANK Sparing | Reliability | Reserved/spare DRAM RANKs are utilized to dynamically map out the failing DRAM RANK into the spare Ranks. |
Predictive Failure Analysis | Serviceability | Extensive error logs to assist software in predicting failures |
Failed DIMM Isolation | Serviceability | Extensive error logs to help software identify the failing DIMM |
Virtual (soft) Partitioning | Reliability | Virtual Machine Monitor ability to make use of hardware recovery , signaling and error logs |
Error reporting via IOMCA | Serviceability | Unified error reporting of the IIO logic to the OS |
Error reporting through MCA 2.0 (eMCA gen2) | Serviceability | Firmware first model enables a reliable error sourcing capability |
Error reporting through eMCA gen1 | Serviceability | Firmware first model enables reliable error sourcing capability |
PCIe Card Hot Plug NVMe (Add, Remove, Swap) | Serviceability | Hot add and repalce of NVMe drives |
PCI Express ECRC | Reliability | PCI Express End to end CRC checking, increasing system reliability |
PCIe Corrupt Data Containment (Data Poisoning) | Reliability | PCIe corrupt data mode of operation, synchronous signaling of the corrupted data along with data, increases system reliability |
PCIe Link CRC Error Check and Retry | Reliability | PCIe link CRC error check and retry, system reliability and recovery from transient errors |
PCIe Link Retraining and Recovery | Reliability | PCIe link retraining and attempted recovery from persistent link transient errors |
Mem SMBus hang recovery | Reliability | Software ability to reset memory SMBus interface to recover from hang condition |
DDR4 Command/ Address Parity Check and Retry | Reliability | DDR4 Address and command parity check and retry in the event of errors |
Time-out timer Schemes | Serviceability | Hierarchy of transaction time outs to assist system debug and reliable error sourcing. |
Intel® UPI Link Level Retry | Reliability | Intel UPI link’s ability to perform CRC check and retry on errors for higher degree of system reliability |
Intel® UPI Protocol Protection via 16 bit Rolling CRC | Reliability | Detection of transient data errors over Intel UPI interconnects, via 16bit CRC error checking |
Processor BIST | Serviceability | At power up, Processor’s built in self test engine performs test on the internal cache structure for and provides the results to the system BIOS |
Socket disable for FRB | Availability | The capability to selectively disable socket at the boot time, and therefore allowing system to power-on in a failover configuration |
Core disable for FRB | Availability | The capability to disable failing cores at boot time, map out the failing core |
PIROM for System Information Storage | Serviceability | On package Processor Information ROM |
Conclusion
The Lenovo ThinkSystem SR950 and SR850 RAS technologies drive the outstanding system availability and uninterrupted application performance needed to host business or mission-critical applications.
Enterprises whose productivity and success depend on large-scale, mission-critical applications require a scale-up high availability server. The SR950 and SR850 must be on the shortlist for any enterprise that is looking at refreshing its high availability or mission critical systems.
Further reading
For further reading, see these resources
This article is one in a series on the ThinkSystem SR950 and SR850 servers:
- Five Highlights of the ThinkSystem SR950
- Five Highlights of the ThinkSystem SR850
- Choosing between Lenovo ThinkSystem SR850 and SR950
- Workloads for 4-Socket and 8-Socket Servers
- Usability in the Design of the ThinkSystem SR950
- The Value of Refreshing Your 4-Socket Servers with the ThinkSystem SR950
- ThinkSystem SR950 Memory Decisions
- ThinkSystem SR950 Server Configurations
- The Value of Refreshing Your 8-Socket Servers with the ThinkSystem SR950
- RAS Features of the Lenovo ThinkSystem SR950 and SR850
- Lenovo ThinkSystem SR950 New Options and Features - December 2017
- ThinkSystem SR950 Performance Leadership
- Lenovo Servers for Mission Critical Workloads
- Microsoft and Lenovo ThinkSystem SR950 – A Perfect Match
- Accelerate Your 4- and 8-Socket Server Refresh Cycle
- SAP Business Process Applications and Lenovo ThinkSystem SR950 – A Perfect Match
- ThinkSystem SR950 New Options - March 2018
- SAP HANA and Lenovo ThinkSystem SR950 – A Perfect Match
- ThinkSystem SR950 Performance Leadership Continues
- New Solution for SAP HANA - Lenovo ThinkAgile HX
- The Advantages of Keeping Mission Critical Workloads On-Premises vs Going to the Cloud
- SQL Server Migration and Lenovo ThinkSystem SR950
About the author
Randall Lundin is the Mission Critical Product Manager in the Lenovo Data Center Group. He is responsible for managing and planning Lenovo’s 4-socket and 8-socket servers. Randall has also authored and contributed to numerous Lenovo Press publications in the Mission Critical space.
Related product families
Product families related to this document are the following:
Trademarks
Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.
The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
ThinkAgile
ThinkSystem
XClarity®
The following terms are trademarks of other companies:
Intel® and Xeon® are trademarks of Intel Corporation or its subsidiaries.
Microsoft® and SQL Server® are trademarks of Microsoft Corporation in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.