lbfo

Project Home Lab: Planning for Recovery

In my last post, Server Surgery Aftermath, I talked about the issues I was having with my home server. Whilst continuing to try and identify the issues after the post, I ran across some more BSODs and I managed to collect useful crash dumps for a number of them. Reviewing the crash dumps with WinDbg from the Windows Debugging Tools, I was able to see that in every instance of the BSOD, the faulting module was network related with the blame shared equally between Ndis.sys and NdisImPlatform.sys which means that my previous suspicion of the LSI MegaRAID controller were out of the window.

Included in the trace was the name of another application which is running on the server. I’m not going to name the application in this instance but let’s just say that said application is able to burst ingress traffic as fast as my internet connection can handle it. I decided to intentionally try and make the server crash by starting up the application and generating traffic with it and sure enough within a couple of minutes the server experienced a BSOD and restarted. This started to now make sense because the Windows Service for this application is configured for Automatic Delayed start which is why in one instance after a BSOD, the server had another BSOD about 45 seconds later.

For the interim, I have disabled the services for this application and with the information in hand, I started looking more closely into the networking arrangements. I knew that as part of the server relocation, I had switched from my dual port PCIe Intel PRO 1000/PT adapter to the on-board Intel 82576 adapters and both of these adapter ports are configured in a single Windows Server native LBFO team using the Static Team mode which is connected to a Static LAG on my switch.

To keep this story reasonably short, it turns out that the Windows Update provided network driver for my Intel adapters is quite old but yet the driver set 19.5 that Intel advertise as being the latest available for my adapters doesn’t support Windows Server 2012 R2 but will only install on Windows Server 2012. Even booting the server into the Disable Driver Enforcement mode didn’t allow the drivers to install. I quickly found that many other people have had similar issues with Intel drivers due to them blocking drivers on selected operating systems for no good reason.

I found a post at http://foxdeploy.com/2013/09/12/hacking-an-intel-network-card-to-work-on-server-2012-r2/ which really helped me understand the Intel driver and how to hack it to remove the Windows Server 2012 R2 restrictions to allow it to be installed. The changes I had to make differed slightly due to me having a different adapter model but the process remained the same.

Because my home server is considered production in my house, I can’t just go right ahead and test things on it like hacked drivers so luckily, my single hardware architecture vision came out on top because I’ve installed the hacked and updated Intel driver on the Lab Storage Server and the Hyper-V server with no ill effects. I’ve even tested putting load between the two of them over the network and there has been no issues either so this weekend I will be taking the home servers’ life in my hands and replacing the drivers and hopefully that will be the fix.

If you want to read my full story behind the Intel issue troubleshooting, there is a thread I started on the Intel Communities (with no replies I may add) but all the background detail is there at https://communities.intel.com/thread/58921?sr=stream..