Just before christmas I was asked how one should patch a cluster that has not been patched in over a year. The answer to that question is a bit tricky but I will help to guide you and tell you the pros and cons of doing it in the 2 ways that was tried on 2 different clusters.
The 2 ways of doing it is
Online
In this way the cluster is still up and running and you patch 1 and one node.
Pros
- No downtime, users can still work and you will be servicing your clients
Cons
- Takes longer, way longer
- Rebuilds between each server reboot
- If there is an issue outage might happen
- Could result in dataloss if you hit one of the major bugs that has been solved in newer CU’s
Offline
Here you stop all cluster resources, stop the cluster and disable the cluster service on each node.
Pros
- Fast, very fast actually from start to finish
- No risk of data loss
- No rebuilds as cluster is offline
Cons
- Need to schedule a downtime of the cluster
Case
The client was not able to schedule a maintenance window where he could take the cluster offline. As they were on November 2017 patches they were 13 months behind and asked how they should proceed. Doing online patching of a S2D cluster that has not been patched in that long time requires one to do the patches in steps. Not go all the way to latest CU at once per node. So the question was what patches do need to go in.
In May 2018 Microsoft released a Servicing Stack Update(SSU) to address some vulnerability’s in the Intel CPU architecture and some Bitlocker issues. So this needs to go in. Now Microsoft also released a new Servicing Stack in November that has the May SSU as a requirement before installing the Nov one.
So to get to Nov/Dec Cumulative Update you need to apply the following KB’s in order first.
KB4132216
KB4465659
After these 2 small patches are run you can now apply the latest Cumulative Update for Windows Server 2016. Now this list might change in the future of SSU’s that needs to be applied before you can install a Cumulative Update, so always make sure to read every new SSU that has been release to see if they require any prerequisite’s. As it could be that you will need to apply 2-3-4 SSU’s before a CU if you are far behind.
The client decided to run the patches in steps. Applying the KB4132216, rebooting and then applying KB4103720(May Patch) as they did not want to go all the way to the December CU. They did this on the first node, set the node in maintenance mode and disk maintenance mode, then rebooted and everything came back up fine. Once repairs where done, they proceeded with server 2 doing the same steps. When node 2 came back up things started to go bad. Disks went into Lost Communications, Virtual Disks went offline and suddenly all Virtual Disks where offline. Once they got the disks back they managed to get the Virtual Disks back online and repaired and ok and everything was up and running. But it took them about 12 hours to get all virtual machines back up again.
The question came back again, what should we do. The answer was as before, offline patching. Stop the entire cluster. So they scheduled to take the entire cluster down at 0100 the next night. And proceeded to patch all nodes.
To do the offline patching follow these steps.
#Stop Cluster
$ClusterName = "JTHVS2DCL"
$Cluster = Get-Cluster -Name $Clustername
$Clustername = $Cluster.Name+"."+$cluster.Domain
$ClusterNodes = Get-ClusterNode -Cluster $Cluster
Invoke-Command -ComputerName $ClusterNodes[0] -ScriptBlock {
Get-ClusterResource | Where-Object ResourceType -EQ "Virtual Machine" | Stop-ClusterResource
}
Stop-Cluster -Cluster $Clustername
Foreach ($Clusternode in $ClusterNodes){
Set-Service -ComputerName $Clusternode -Name ClusSvc -StartupType Disabled
}
#Now patch and reboot all nodes
#Start Cluster after patching and all nodes back up
Foreach ($Clusternode in $ClusterNodes){
Set-Service -ComputerName $Clusternode -Name ClusSvc -StartupType Automatic
}
Invoke-Command -ComputerName $ClusterNodes[0] -ScriptBlock {
Start-Cluster
}
### Verify that cluster resources and disks are back up and if any storage jobs are running.
Invoke-Command -ComputerName $ClusterNodes[0] -ScriptBlock {
Get-ClusterResource | Where-Object ResourceType -NotLike "*Virtual Machine*" | Where-Object State -EQ "Offline" | Start-ClusterResource
Get-VirtualDisk
Get-Storagejob
}
#Verify that all PhysicalDisks are back up
Invoke-Command -ComputerName $ClusterNodes[0] -ScriptBlock {
Get-Physicaldisk
}
#Now you can start the virtual machines either all with powershell or manualy in FCM or any other tool you use
Invoke-Command -ComputerName $ClusterNodes[0] -ScriptBlock {
Get-ClusterResource | Where-Object ResourceType -Like "*Virtual Machine*" | Where-Object State -EQ "Offline" | Start-ClusterResource
}
This will safely patch your cluster with limited downtime and almost 0% risk for any data corruption.
Conclusion
Depending on how brave you want to be, you can choose between the on-line(cluster up) or the offline version(cluster stopped). If I had to choose id schedule a downtime for the cluster as what we have seen is jumping 10-11-12-13 months or more in Cumulative Updates can cause major issues. As there are so many changes and bug fixes in a 1 year time frame. And it will be the fastest upgrade as well. As you can patch all servers at once.
I always recommend that you have a fresh backup of the data from the cluster. As one can never be 100% sure what might happen.
For any questions, join me in our Storage Spaces Direct Slack channel at https://storagespacesdirect.slack.com we are a big community with experts and users sharing our experience and help out as best as we can.
reference: https://jtpedersen.com/2019/01/so-you-have-not-patched-your-storage-spaces-direct-cluster-in-a-year/