vSphere 6.5: DRS what’s new – Part 3 – Proactive HA
Adding onto the many other new features and functionality that I’ve written about in Part 1 and Part 2, I want to touch on Proactive HA. Although Proactive HA sounds like it is an HA feature (and you will see that it is under an availability tab), it is actually a feature of DRS that proactively avoids needing to use vSphere HA (I’ll get into that below.) The idea being that many times, host hardware can begin to fail without the VI Admin’s knowledge. It’s possible that this goes on for minutes, hours, or even days, when it evenutally fails and workloads need to be HA restarted when in reality, if only the VI Admin (or vCenter) had known, it could have kept the workloads from failing.
In the words of Dilbert:
Proactive HA – Give me the details!
Ok, let me give you the scoop. Proactive HA is a new feature that we have been developing that adds an additional layer of availability to your environment. Proactive HA integrates with the Server vendor’s monitoring software (more on this later), via a Web Client plugin, which will pass detailed server health status/alerts to DRS, and DRS will react based on the health state of the host’s hardware.
New Server Health State – Quarantine Mode
In introducing this new feature, we’ve also introduced a new host health state called Quarantine Mode. Quarantine Mode can only be induced from Proactive HA (not manually enabled/disabled like maintenance mode). Quarantine Mode will attempt to evacuate it’s running virtual machines IF:
- No impact of VM performance results on any virtual machine in the cluster
- None of the DRS Affinity/Anti-Affinity rules are violated
If the above are satisfied, then the VMs will evacuate and DRS will avoid placing virtual machines on said quarantined host.
How is this different than Maintenance Mode?
Great question! This is different because unlike Maintenance Mode, a host in Quarantine Mode can and will be utilized EVEN IN QUARANTINE MODE if there are not enough resources in the cluster to satisfy the performance of all the virtual machines that are running. So in the case of maintenance mode, that host is completely unusable, regardless of the situation. In Quarantine Mode, DRS will evacuate the host and avoid placing other workloads on it, unless performance degradation is imminent, at which time, the QM host will begin accepting workloads.
You mentioned this works with server vendor’s monitoring software. Who are you working with?
Currently we’ve been working with a select few of our partners with the biggest footprint:
As Proactive HA will be GA in 6.5, our partners are working with us to get their plugins and Proactive HA solutions certified and out the door. Expect more from me (and them) as that time nears.
What does this look like? Where do I configure it?
Proactive HA will most likely be configured in two locations. First, in the server management software (Cisco UCS Manager, Dell Openview, etc). The second is in the cluster settings for a give group of hosts.
For the cluster settings, select a cluster > Edit Settings > vSphere Availability
(Ok, Pause!) Now before everyone freaks out, let me explain why this doesn’t say vSphere HA. When we added this feature, although it leverages DRS in the background, it’s still an availability feature, moreso than an performance feature. It didn’t make sense to put it under vSphere HA, so the decision was made to rename vSphere HA to vSphere Availability. Once inside that tab, you will still see all of the vSphere HA settings. (BREATH! I know change can be hard, but bear with me)
Here you will see the newly designed Availability tab. We can begin by turning on Proactive HA here.
You can see the information bubble will give you a little more detail about this feature.
Once Proactive HA has been turned on, click on Proactive HA Failures and Responses. You’ll see Automation Level, Remediation, and Provider sections.
If we expand both of these two sections you will see what should be very familiar to you, DRS Automation level settings, which are independent of DRS for the cluster. These settings are 100% solely for Proactive HA. If placed in Manual mode, you WILL have to watch for the DRS recommendations and apply the recommendations within the web client.
Remediation gives you 3 options for how Proactive HA will handle host-degradation alerts. The first is to place a host into Quarantine Mode for any alert or degradation regardless of the severity.
The second is to place hosts into Quarantine Mode for moderate degradation, but Maintenance Mode for severe degradation.
The third option is to place hosts into Maintenance Mode for any alert or degradation regardless of severity. I recommend Quarantine Mode for all failures, but that’s just me.
Below the Remediation section you’ll see the provider section. A provider will be shown when the corresponding web client plugin is installed in vCenter. In this case, I’ve posted a beta provider that I used during our VMware Beta Program. Normally this would be the name of the Hardware provider. To enable the provider, click the checkbox and click ‘OK‘. And that’s it!
Proactive HA DOES allow you to choose what is being monitored and alerted on based on the provider. Each provider will be different in that it comes from the server vendor and may have additional features/functionality that their competitors don’t. But let’s say that you did not want Proactive HA to report on Fan failures in the server chassis for whatever reason. Instead of clicking ‘OK’ like we did above, click ‘edit‘ next to the provider. It will pop up a window that will allow you to select any ‘Failure condition’ that the provider reports on. Since this is a demo provider, I’ve only included two failure conditions, but expect to see several dozen from our partners. In this case, I would check the ‘Fan’ box and click ‘OK’ from here, Proactive HA will filter out any FAN alerts from causing remediation.
As you can see in this picture, my host 10.161.241.85 has been placed into Quarantine Mode. If I look at the events for this host I can see a few entries. This one in particular stands out. It says “Beta Demo Provider reported a moderately degraded status on host 10.161.241.85 for ID 101 (PSU general health monitoring) in update 1472224377494. Please contact your hardware vendor support.”
There is also a spot for remediation instructions that partners can use for potentially reporting which part number you need to replace, instructions on what to do, or KB articles to look at.
I will have more details and more in-depth articles as our partners finish up their integrations with Proactive HA. We are very excited to be able to add this additional layer of availability for your virtual machines, in hopes that Proactive HA can evacuate a host before it fails. In our betas, on-site discussions, and customer technical advisory boards, the customers with which we showed this to were exstatic to be able to increase VM uptime and save workloads from requiring vSphere HA.