Technical debt. This is a term used more and more to describe established IT environments that are overdue for updates of many kinds. The demand to deploy new systems and applications along with the lack of budget for regular hardware replacement means regular maintenance tasks can be ignored or deferred too many times, leaving the IT environment at risk.
For example, running obsolete vSphere ESXi v5.1 on old processor hardware (i.e., Sandy Bridge, Ivy Bridge, or Opteron servers) means responding to any failure will result in a mad scramble to find replacements or to get workarounds at the application level. Best practices for networking, virtualization, and backups also evolve over time but we rarely see deployed systems updated to current standards.
A few years ago, we got a call from a customer that one of their storage arrays (in a remote location) had gone offline and they wanted our help to address the problem. When our engineer went to the site, he found that the array shut down when the fourth(!) cache backup battery failed, so both controllers were locked in read-only mode. When he looked deeper, he found the equipment was still running the original deployed code, and that no firmware maintenance had EVER been performed. The systems were so far behind the current support matrix we had to have HPE services come in to perform the required maintenance. Once all four cache batteries were replaced, the storage array came back up so it could be updated to supported code.
A Recent Wake Up Call
As a response to the rootkit attacking Hewlett Packard Enterprise's iLO server manager hit the news in December 2021, we all had to scramble for information and determine if any of our sites (or our customers) were at risk. Attacks against hardware management devices are especially scary because they can affect the servers long before the operating system or hypervisor boots and iLO can manage storage, and even trigger secure drive wipes remotely.
With all the publicity around the new rootkit in the field, we were aware that this vulnerability was detected in August 2017. CVE-2017-12542 was addressed by HPE back then on this security bulletin.
Per this security bulletin, only iLO4 servers running firmware before v2.53 (released May 5, 2017) are effective. Today, this may sound dismissive, but as recently as 2019 we have seen servers in the field with iLO firmware released in 2015 and not updated since!
What does this mean for you? Systems that aren’t patched because "they are working fine", servers that are running obsolete and unsupported code, and networks that are configured for convenience but not security are all forms of technical debt (shortcuts for the sake of speed or budget) that eventually must be repaid.
Patch, Patch, Patch
First and foremost, you should be updating the applications, ProLiant Support Pack (PSP) – and iLO firmware – deployed on your servers on a regular basis. System maintenance requires regular scheduled down time, but modern virtualized or clustered environments should allow server reboots without service outages. Keeping applications patched helps with addressing issues like Log4j.
Keep it Current
Secondly, keep your deployed applications current. Yes, this can be time-consuming but recovering from an application issue when the app was last supported 10 years ago will be a nightmare. Legacy systems like building operations (e.g., HVAC controls, access controls) still need to be updated over time. Running operating systems or hypervisors that are beyond support is also a bad idea. Yes, this will drive hardware replacements but that is a normal cost of operations. A server that is running applications that are 10 years old is 5 years overdue for replacement, both hardware and software.
Network Design for Security First, not Convenience
Finally, device management should be done on a subnet with very restricted access and no unmanaged bidirectional Internet access at all!
One report on iLOBleed showed the results of a Shodan scan: hundreds of iLO ports directly exposed to the Internet. This type of scan only exposes iLO ports on the Internet, not ports that are part of the main production network in your environment. Either way, unrestricted access to iLO ports is a security risk.
Device management ports should be on a management subnet with access controls blocking traffic from unauthorized sources. The new iLO Amplifier virtual appliance for reporting to the Insight Manager cloud acts as an outbound relay, masking the iLO network as a result. Your network should also have access control lists configured to allow specific workstations to access the management network to manage switches, storage, and other devices. The general run of user workstations should not have access to management features. A PC that is infected with malware can’t infect devices it can’t see.
New Server Hardware Too
This should also be a wake-up call to replace your older iLO3 and iLO4 (Gen9 and earlier) servers with Gen10 systems running iLO5 management. One of the most publicized features of Gen10/iLO5 is adding the Silicon Root of Trust feature: Signatures for validation of integrity of iLO and UEFI/BIOS built into the iLO ASIC to prevent any possibility of tampering with security signatures throughout the supply chain.
Where Do I Start?
An Nth Generation Business Impact Analysis (BIA) is a good place to start documenting your environment, knowing what you have is the foundation for moving forward. A BIA creates a full inventory of not just the hardware on site but also documents the application software deployed along with what group owns the product and how the systems are interrelated. The BIA documents any outdated or obsolete/unsupported assets in the environment.
Another piece of the puzzle is to implement (and execute) regularly planned and scheduled maintenance outages. A scheduled service outage twice a year is far better than an unplanned failure outage in the middle of the busy season or end-of-month business processing.
Nth can provide technical assistance implementing networking updates to improve security, and the time and effort required for patching, especially for blade environments which can have complex dependencies during firmware updates. We can also help with deploying new tools like the iLO Accelerator Pack as part of securing the management network.
When it is time to update and replace servers, Nth Solution Architects can help properly size new hardware. The newest generation of Intel and AMD processors are much higher performance per core and GHz of clock speed and support larger memory capacities compared to Intel E5-2600v3 and older or AMD Opteron processors. More computing horsepower and more RAM in a server allow for true server consolidation, saving you not just space and power in your datacenter but software licensing and support costs when you retire excess socket-based licenses.