Posts from January 2015

Support My Wife Without a Penny from Your Pocket

Apologies to those who come here for a techo-fix but I’ve got to drop one personal post in here from time to time 🙂

My wife who is currently a 3rd year university student studying a degree in Midwifery has the option to be able to do an elective period of voluntary placement work overseas during this, her last year of the course. She is looking to work in Gibraltar for two weeks to complete this elective work module because it gives her a taste of how they do things overseas without the complexities of a language barrier.

These elective placements are not paid work and they are essentially working for free in exchange for getting their hands on experience and learning.

Because there are already enough charities asking for your money each month, she’s decided to take a different tact to fundraising using a site called Easy Fundraising. The way this site works is that when you go about your travels of the Internet, buying your wares from eBay, Amazon or other retailers, Easy Fundraising collect a small percentage of the value of the sale from the online stores in referral fees.

Simply visit http://www.easyfundraising.org.uk/causes/nicolagreen/ or using the bit.ly short-link I have created http://bit.ly/nickygib and make your online purchases via the Easy Fundraising site and we collect the referral bonuses. Sadly, I must confess that you do need to register on the Easy Fundraising site before you can support a cause.

International Technology Frustration

We live in a world where our communications are sent around the world in sub-second times thanks to services like Twitter, Facebook and WhatsApp. Thanks to Facebook, LinkedIn and other people hubs we are closer connected to those around us without geographic discrimination and thanks to all of this high-speed communication and information transfer, we discover news and new information faster than ever before.

Taking all of this into consideration, why is it, that we are still in a world where one country takes the glut of the new technology releases without them officially seeing the streets of foreign lands only assisting to line the pockets of the lucky few who are able to import and export these technologies and sell them in the foreign lands via channels like eBay at exorbitant prices.

In the technology arena, Microsoft are one of the worst offenders for doing this. There’s been a number of releases over the years including but not limited to the Zune, Surface Pro, the Microsoft Band and the Wireless Display Adapter for screen Miracast that have been released and neither of them have been released outside of the borders of the US and Canada. Why is it that these highly sought after devices are only being sold in the US and not sold worldwide via Microsoft’s normal retail channels?

I remember when the Surface Pro first launched and I waited months to get one officially in the UK but it never came so I ended up importing one from the US with the help from a former co-worker. Back when Zune was a thing, I happened to be in the US on a long bought of work with my family in tow so I decided to buy one whilst I was there. I for one, would snap up a pair of Wireless Display Adapters and a Microsoft Band the day that they went on sale if they did ever appear here in the UK but I’m not holding out much hope which leaves me with the remaining option to buy them via eBay sellers.

The Microsoft Band is in high demand right now and whilst there a few of them on eBay UK for sale, the price is riding higher than retail and given that the device isn’t officially available here in UK, you don’t know how your warranty will be effected.

The Wireless Display Adapter isn’t quite so hot, largely because other competing products are available in the UK such as the Netgear PTV3000 and as a result of this, if I wanted one, I’d have to buy one from a seller on eBay US and pay whatever import and duty taxes the British government deemed appropriate and then pay whatever handling tax DHL or UPS levy on the shipment for the privilege of advancing my customs payment for me.

All this behaviour results in is a reduced consumer experience because there are devices out there that we want and the companies making them aren’t making them available to us so middle-men fill the void lining their own pockets with profit and driving the retail price up for consumers like you and me. I know that beaming a packet of data down an undersea fibre is obviously easier than arranging shipping and stocking of physical goods, but my point here is that with all of this technology to tell us what is happening around the world, to let us see what we could have, it’s akin to teasing a kid with a lollipop, waving it in front of their face and showing them it, videoing you licking it and playing it over and over again in their face. The kid will end us crying and wanting the lolly and you’d likely give in and let them have it after enough tantrum so why can’t companies see the same logic?

If the trend of devices only being released into the US and not being made available in Europe and the UK (and let us not forget our friends in Australia and New Zealand) continues then I think anything relating to the devices should be applied with IP filters to block people from outside of the availability regions from seeing, hearing or reading anything about it. At least that way, we wouldn’t have the lolly being waved under our noses to tempt us without the opportunity to ever have the lolly.

Free Fitbit Flex with Windows Phone Purchases

If you’re in the market for both a new smartphone and a fitness aid this year, Windows Phone could defiantly be your friend.

Microsoft UK are currently running a promotion that started on January 12th 2015 and runs until March 31st 2015. If you purchase either Microsoft Lumia 735, 830 or 930 between these dates from one of the eligible retailers (almost all UK high street and network outlets are listed) then you can claim a free Fitbit Flex fitness activity and sleep tracking device.

To find out more information about the detail then visit http://www.microsoft.com/en-gb/mobile/campaign-fitbit/. If you want to skip straight to claiming your Fitbit device or want to know if your device is eligible then download the Fitbit Gift app from the Windows Phone Store at http://www.windowsphone.com/en-gb/store/app/fitbit-gift/ee34cfd1-e302-4820-a3cc-0d4e349ccf6a.

I’m a Fitbit user so I like the idea of this promotion but I equally struggle to see it: Microsoft are now in the fitness and activity and sleep tracking business with the Microsoft Band but as we know, this isn’t available in the UK right now. I have to question whether this promotion would instead be against the Microsoft Band if it was available here. Given that the Flex retails for £60 and the Microsoft Band is $200 in the US, I can’t imagine it would be a free promotion like they have on the Flex but I think it would likely be a discount code for £50 off the price of a Microsoft Band.

Fingers crossed the Microsoft Band makes its was UK-side via official channels one day soon and the promotion will flip on it’s head. Don’t forget that all Windows Phone 8.1 devices are going to be eligible for Windows 10 upgrades once the new OS ships too.

Invalid License Key Error When Performing Windows Edition Upgrade

Last week, I decided to perform the in-place edition upgrade from Windows Server 2012 R2 Essentials to Windows Server 2012 R2 Standard on my home server as part of a multitude of things I’m working on at home right now. Following the TechNet article for the command to run and the impact and implications of doing the edition upgrade at http://technet.microsoft.com/en-us/library/jj247582 I ran the command as instructed in the article but I kept getting a license key error stating that my license key was not valid.

As my server was originally licensed under a TechNet key, I wondered if the problem could be down to different licensing channels preventing me from installing the key. On the server, I ran the command cscript slmgr.vbs /dlv to display the detailed license information and the channel was reported as Retail as I expected for a TechNet key. The key I am trying to use is an MSDN key which also should be reported as part of the Retail channel but to verify that, I downloaded the Ultimate PID Checker from http://janek2012.eu/ultimate-pid-checker/ and my Windows Server 2012 R2 Standard license key, sure enough is good, valid and just as importantly, from the Retail channel.

So my existing and new keys are from the same licensing channel and the new key checks out as being valid so what is the problem? Well it turns out, PowerShell was the problem.

Typically I launch a PowerShell prompt and then I enter cmd.exe if I need to run a something which explicitly requires a Command Prompt. This makes it easy for me to jump back and forth between PowerShell and Command Prompt within a single window hence the reason for doing it. I decided to try it differently so I opened a new Administrative Command Prompt standalone, without using PowerShell as my entry point and the key was accepted and everything worked as planned.

The lesson here is this: If you are entering a command into a PowerShell prompt and it’s not working, try it natively within a Command Prompt as that just maybe is your problem.

Tesco Hudl 2 Date and Time Repeatedly Incorrect

Since about a week or so ago, the kids Tesco Hudl 2 tablets that they got for Christmas have been consistently reporting the wrong date and time. The issue is easily spotted because anytime they launch an app or open the Google Play Store or perform any action that depends on an SSL certificate, they are shown a certificate warning due to the inconsistency between the server date and time and the client date and time. Sometimes the tablet can appear just a few hours out of sync but in the main, it seems that the devices reset their date to January 1st 2015.

Yesterday, I noticed for the first time that my Hudl 2 tablet started exhibiting the same behaviour which led me to look online to see if this is a widespread issue as I couldn’t believe that all four of our Hudl 2 tablets could show the same symptoms and problems within two weeks’ of each other, especially considering I bought my tablet about a month after we bought the kids theirs so they would likely be from different batches of manufacturing.

Searching online, I came across a thread on the Modaco forums at http://www.modaco.com/topic/373796-misreported-time-and-other-things/ where other users are reporting the same issue and that it only seems to manifest after circa one month of using the device: an interesting observation given that I first powered up the kids tablets the week before Christmas to configure them and I got mine the week after Christmas.

Several users have tried contacting Tesco Technical Support and are advised to hard reset the devices or to exchange them in a local store but the issue continues to return and it appears from one commenter that Tesco is now working on a firmware update to address the issue. To me, this says that the current firmware build clearly has an issue relating to the CPU clock and tracking the time in relation to the CPU clock.

I reached out to Tesco on Twitter today to try and find out if it’s possible to contact their support via email or Twitter as opposed to phone as I don’t want to have to call them to add four new serial numbers to the list of effected devices that they are tracking. If I get a response, I’ll update the post here but in the meantime, if you have a Hudl 2 from Tesco and are experiencing the same date and time reset issue, it’s not you, it appears to be a known problem they are working on but please do report it to Tesco.

The more people that report the issue, the faster Tesco are likely to work on the firmware update and get it released.

 

Project Home Lab: Planning for Recovery

In my last post, Server Surgery Aftermath, I talked about the issues I was having with my home server. Whilst continuing to try and identify the issues after the post, I ran across some more BSODs and I managed to collect useful crash dumps for a number of them. Reviewing the crash dumps with WinDbg from the Windows Debugging Tools, I was able to see that in every instance of the BSOD, the faulting module was network related with the blame shared equally between Ndis.sys and NdisImPlatform.sys which means that my previous suspicion of the LSI MegaRAID controller were out of the window.

Included in the trace was the name of another application which is running on the server. I’m not going to name the application in this instance but let’s just say that said application is able to burst ingress traffic as fast as my internet connection can handle it. I decided to intentionally try and make the server crash by starting up the application and generating traffic with it and sure enough within a couple of minutes the server experienced a BSOD and restarted. This started to now make sense because the Windows Service for this application is configured for Automatic Delayed start which is why in one instance after a BSOD, the server had another BSOD about 45 seconds later.

For the interim, I have disabled the services for this application and with the information in hand, I started looking more closely into the networking arrangements. I knew that as part of the server relocation, I had switched from my dual port PCIe Intel PRO 1000/PT adapter to the on-board Intel 82576 adapters and both of these adapter ports are configured in a single Windows Server native LBFO team using the Static Team mode which is connected to a Static LAG on my switch.

To keep this story reasonably short, it turns out that the Windows Update provided network driver for my Intel adapters is quite old but yet the driver set 19.5 that Intel advertise as being the latest available for my adapters doesn’t support Windows Server 2012 R2 but will only install on Windows Server 2012. Even booting the server into the Disable Driver Enforcement mode didn’t allow the drivers to install. I quickly found that many other people have had similar issues with Intel drivers due to them blocking drivers on selected operating systems for no good reason.

I found a post at http://foxdeploy.com/2013/09/12/hacking-an-intel-network-card-to-work-on-server-2012-r2/ which really helped me understand the Intel driver and how to hack it to remove the Windows Server 2012 R2 restrictions to allow it to be installed. The changes I had to make differed slightly due to me having a different adapter model but the process remained the same.

Because my home server is considered production in my house, I can’t just go right ahead and test things on it like hacked drivers so luckily, my single hardware architecture vision came out on top because I’ve installed the hacked and updated Intel driver on the Lab Storage Server and the Hyper-V server with no ill effects. I’ve even tested putting load between the two of them over the network and there has been no issues either so this weekend I will be taking the home servers’ life in my hands and replacing the drivers and hopefully that will be the fix.

If you want to read my full story behind the Intel issue troubleshooting, there is a thread I started on the Intel Communities (with no replies I may add) but all the background detail is there at https://communities.intel.com/thread/58921?sr=stream..

Project Home Lab: Server Surgery Aftermath

So it seems that in my last post about relocating the Home Server into the new chassis was spoken a little too soon. Over New Year, a few days after the post, I started to have some problems with the machine.

It first happened when I removed the 3TB drive from the chassis to replace it with the new 5TB drive which caused a Storage Spaces rebuild and all of the drives started to chatter away copying blocks around and about half-way through the rebuild, the server stopped responding to pings. I jumped on to the IPMI remote console expecting to see that I was using so much I/O on the rebuild that it had decided to stop responding on the network but in actual fact, the screen was blank and there was nothing there. I tried to start a graceful shutdown using the IMPI but that failed to I had to reset the server.

When Windows came back up, it greeted me with the unexpected shutdown error. I checked Storage Spaces and the rebuild had resumed with no lost data or drives and eventually (there’s a lot of data there) it completed and I thought nothing more of it all until New Years day when the same thing happened again. This time, after restarting the server and realising this was no one off event, I changed the Startup and Recovery settings in Windows to generate a Small Memory Dump (256KB) otherwise known as a Minidump and I also disabled the automatic restart option as I wanted to try and get a chance to see the Blue Screen of Death (BSOD) if there was one.

Nothing happened on this front until yesterday. The server hung again and I restarted it but within two minutes of hanging, it did the same thing again. I decided to leave the server off for about five minutes to give it a little rest and then power it back up and since then I’ve had no issues but I have gathered a lot of data and information in the time wisely.

I used WinDbg from the Windows Debugging Tools in the Windows SDK to read the Minidump file and the resultant fault code was WHEA Uncorrectable Error with a code of 0x124. To my sadness, this appears to be one of the most vague error messages in the whole of Windows. This code means that a hardware fault occurred which Windows could not recover from but because the CPU is the last device to be seen before the crash, it looks as if the fault is coming from the CPU. The stack trace includes four arguments for the fault code and the first argument is meant to contain the ID of the device which was seen by the CPU to have faulted but you guessed it, it doesn’t.

So knowing that I’ve got something wierd going on with my hardware, I considered the possibilities. The machine is using a new motherboard so I don’t suspect that initially. It’s using a used processor and memory from eBay which are suspects one and two and it’s using the LSI MegaRAID controller from my existing build. The controller is a suspect due to the fact that on each occasion the crash has occurred, there has been a relative amount of disk I/O taking place (Storage Spaces rebuild the first time and multiple Plex Media Server streams taking place on the other occasions).

The Basic Tests

First and foremost, I checked against Windows Update and all of my patches are in order which I already knew but wanted to verify. Next, I checked my drivers as a faulting driver could cause something bad to get to the hardware and generate the fault. All of the components in the system are using WHQL signed drivers from Microsoft which have come down through Windows Update except for the RAID Controller. I checked the LSI website and there was a newer version of the LSI MegaRAID driver for my 9280-16i4e card available as well as a new version of the MegaRAID Storage Manager application so I applied both of these requiring a restart.

I checked the Intel website for drivers for both the Intel 82576 network adapters in the server and the Intel 5500 chipset and even though the version number of the Intel drivers is higher than those from Windows Update, the release date on the Windows Update drivers is later so upon trying to install them, Windows insists that the drivers installed are the best available so I’ll leave these be and won’t try to force drivers into the system.

Next up, I released that the on-board Supermicro SMC2008 SAS controller (an OEM embedded version of the LSI SAS2008 IR chipset) was enabled. I’m not using this controller and don’t have any SAS channels connected to it so I’ve disabled the device in Windows to stop it from loading for now but eventually I will open the chassis and change the pin jumper to physically disable the controller.

Earlier, I mentioned that I consider the LSI controller to be a suspect. The reason for this is not reliability of any kind as the controller is amazing and frankly beyond what I need for simple RAID0 and RAID1 virtual drives but because it is a very powerful card, it requires a lot of cooling. LSI recommend a minimum of 200 Cubic Feet per Minute (CFM) of cooling on this card and with the new chassis I have, the fans are variably controlled by the CPU temperature. Because I have the L5630 low power CPU with four cores, the CPU is not busy in the slightest on this server and as a result, the fan speed stays low.

According to the IPMI sensors, the fan speed is at 900 RPM constant with the currently system and CPU temperatures. The RAID controller is intentionally installed in the furthest possible PCI Express 8x slot from the CPU to ensure that heat is not bled from one device into the other but a byproduct of this is that the heat on the controller is likely not causing a fan speed increase. Using the BIOS, I have changed the fan configuration from the default setting of most efficient which has a minimum speed of 900 RPM to the Better Cooling option which increases the lower limit to 1350 RPM.

Lastly, I raised a support incident with LSI to confirm if there is a way to monitor the processor temperature on the controller however they have informed me that only the more modern dual core architecture controllers have the ability to see the processor temperature either via the WebBIOS or via the MSM application. If I have more problems going forwards, I have a USB temperature probe which I could temporarily install in the chassis but this isn’t going to be wholely accurate however in the meantime, the support engineer at LSI has taken an LSIGET dump of all of the controller and system data and is going to report back to me if there are any problems he can see.

The Burn Tests

Because I don’t want reliability problems on-going, I want to drive the server to crash under my own schedule and see the problems happening in live so that I can try and resolve them, I decided to perform some burn tests.

Memory Testing

Memory corruption and issues with memory is a common cause of BSODs in any system. I am using ECC buffered DIMMs can can correct memory bit errors automatically but that doesn’t mean we want them still so I decided to do a run on Memtest86.

Memtest86 Memory Speed Screenshot

I left this running for longer than the screenshot shows, but as you can see, there are no reported errors in Memtest86 so the memory looks clear of blame. What I really like about these results is that it shows you have incredibly fast the L1 and L2 caches are on the processor and I’m even quite impressed with how fast the DDR3-10600R memory in the DIMMs themselves are.

CPU Testing

For this test, I used a combination of Prim95 and Real Temp to both make the CPU hurt and also to allow me to monitor the temperatures vs. the Max TDP of the processor. I left the test running for over an hour, 100% usage on all four physical cores and here’s the results.

RealTemp CPU Temperature

 

As you can see, the highest the temperature got was 63 degrees Celsius which is 9 degrees short of the Max TDP of the processor. When I log in to the server normally when there are multiple Plex Media Server transcoding sessions occurring the CPU isn’t as utilized as heavily as this test so the fact that it can run at full load and the cooling is sufficient to keep it below Max TDP makes me happy. As a result of the CPU workload, the fan speed was automatically raised by the chassis. Here’s a screenshot of the IPMI Sensor output for both the system temperatures and the fan speed, remembering that the normal speed is 1350 RPM after my change.

IPMI Fan Speed Sensors

IPMI Temperature Sensors

 

To my suprise, the system temperature is actually lower under the load than it is at idle. The increased airflow from the fans at the higher RPM is pushing so much air that it’s able to cool the system to two degrees below the normal idle temperature, neither of which are high by any stretch of the imagination.

Storage I/O Testing

With all of the tests thus far causing me no concern, I was worried about this one. I used ioMeter to test the storage and because ioMeter will fill a volume with a working file to undertake the tests, I created a temporary Storage Space in the drive pool of 10GB and I configured the drive with Simple resiliency level and 9 columns so that it will use all the disks in the pool to generate as much heat in the drives and on the controller as possible.

I ran three tests, 4K block 50% Read, 64K block 50% Read and lastly 256KB block 50% Read. I ran the test for minutes and visiting the garage to look at the server in the rack while this was happening, I was greeted to an interesting light show on the drive access indicator lights. After ten minutes of the sustained I/O, nothing bad happened so I decided to stop the test. Whilst I want to fix any issues, I don’t want to burn out any drives in the process.

Conclusion

At the moment, I’m really none the wiser as to the actual cause of the problem but I am still in part convinced that it is related to the RAID controller overheating. The increased baseline fan speed should hopefully help with this by increasing the CFM of airflow in the chassis to cool the controller. I’m going to be leaving the server be now until I hear from LSI with the results from their data collection. If LSI come up with something useful then great. If LSI aren’t able to come up with anything then I will probably run another set of ioMeter tests but let it run for a longer period to really try and saturate some heat into the controller.

With any luck, I won’t see the problems again and if I do, at least I’m prepared to capture the dump files and understand what’s happening.