In this guide i will explain what you can do to fix a failed virtualdisk in a Failover Cluster. In S2D the ReFS volume will write some metadata to the volume when it mounts it. If it can’t do this for some reason it will jump the virtualdisk from node to node until it’s tried to mount it on the last host. Then it will fail and you will get this state in the event log and the Virtual disk will be failed.

Updated April 18th 2018

clusterlog

If you also look in your ReFS event log you will see things like this

refslog

Now let’s run a powershell command on one of the nodes to look at the VirtualDisk

get-virtualdisk1

Updated section

Microsoft has changed some settings lately on what to do when a ReFS volume goes offline on a CSV. They have given us another parameter to use. Start by running these commands.

Remove-Clustersharedvolume -name "Cluster Virtual Disk (Test)"

Get-ClusterResource -Name "Cluster Virtual Disk (Test)" | Set-ClusterParameter -Name diskrunchkdsk -Value 7
Get-ClusterResource -Name "Cluster Virtual Disk (Test)" | Set-ClusterParameter -Name diskrecoveryaction -Value 1
Start-clusterresource -Name "Cluster Virtual Disk (Test)"

Get-ScheduledTask -TaskName "Data Integrity Scan for Crash Recovery" | Start-ScheduledTask

FYI the Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery” | Start-ScheduledTask needs to be run on the node that owns the  Disk that is failed. Also run just the Get-ScheduledTask -TaskName “Data Integrity Scan for Crash Recovery”  to see when it finishes. It will say running untill it’s done.

You will need to wait for the Data Integrity Scan to finish before continuing. The Integrity Scan will scan about 1 TB used space pr hour. So if you have used 8 TB it will use about 8 hours.

 

Now the virtualdisk should look like this in Failover Cluster manager.

fixvirtualdisk1

Wait for any storage jobs that is running. This might happen. Run Get-StorageJob and it should be empty. Once it’s empty we can add the virtualdisk back as a Cluster Shared Volume

Stop-clusterresource -Name "Cluster Virtual Disk (Test)"

Get-ClusterResource -Name "Cluster Virtual Disk (Test)" | Set-ClusterParameter -Name diskrecoveryaction -Value 0
Get-Clusterresource -Name "Cluster Virtual Disk (Test)" | set-clusterparameter -name diskrunchkdsk -value 0

Add-clustersharedvolume -Name "Cluster Virtual Disk (Test)"
Start-clusterresource -Name "Cluster Virtual Disk (Test)"

Now it should be ok. You can run

Get-ClusterSharedvolume

And it should show as online.

If the volume does not come online after starting the clusterresource. Run the first part again with the dataintegrity scan. let it sit for a while. Then do the 2nd part with commands. One time i had to do this process 10 times over before it came online.