Over the last few weeks we have been having some issues with our Storage Spaces Direct test/dev cluster. To start off i will explain what happened and what did go wrong.
- First i replaced a PCIe NVME card that is used as the caching part of S2D. If you are not familiar about S2D go read this nice little blog from Cosmos Darwin one of the Principal Manager at Microsoft for the S2D and failover clustering. So what happened after i replaced the disks the drives became “lost” they lost communication with the NVME card. Every HDD and SSD need to be bound with the caching device. This happens during boot. Got this fixed by running a patch as explained in this post. https://spirhed.com/how-replacing-a-nvme-card-on-a-s2d-cluster-caused-me-alot-of-hedache/
- Now to the 2nd problem, after the first issue was solved and all drives where rebuilt and happy. We noticed some sever performance drops. SQL restores that took 5 minutes suddenly took 3-4 hours. And the response times where extremely high. 8000 ms on normal OS operations. This turned out to be because our SSD’s did not have Power Loss Protection. Which translate to battery for the cache in form of tantalum capacitors. So what happens is that the data being written is bypassed the cache, and then writes directly to the nand which is not very fast. All explained here by Dan Lovinger at Microsoft. We fixed this by replacing all the SSD’s with Enterprise grade SSD’s like Samsung SM863. You can read my post here. https://spirhed.com/troubleshooting-performance-issues-on-your-windows-storage-storage-spaces-direct/
- Now for the 3rd problem. I rebooted one of our Hyper-V nodes as there was a VM that was stuck in shutting down. And killing the Hyper-V process for that VM did not work. Even with pskill. So when i rebooted the Host the Virtualdisk vent offline. How and why and when you can read my post here. https://spirhed.com/troubleshooting-failed-virtualdisk-on-a-storage-spaces-direct-cluster/
This page might get updated with new troubleshooting items in the future.
And thanks to the S2D, FailOver Cluster, Storage, ReFS team at Microsoft for all the exceptional help. You know who you are 🙂