Posts from October 2014

Two Weeks of Dell, VMware and TechEd

It’s been a while since I’ve worked with VMware to any serious nature but for the last two weeks, I’ve been working with a customer to deploy vSphere 5.5 on a new Dell Vrtx chassis. I’ve seen the Dell Vrtx on display at VMUG conferences gone by and it sure is an interesting proposition but this is the first time I’ve had a chance to work with it real world.

All in all, the Dell Vrtx is a really nice system, everything seems to be well planned and thought out.  The web interface for managing the chassis works but it is slow at times to open pages and refresh information, bearable. The remote KVM console to the blades themselves is Java based so results may vary whether it works or not; I really dislike Java based systems and wish more vendors would start to use HTML5 for their interfaces. There is an apparent lack of information on the Dell Website about the Vrtx system. There is a wealth of configuration guides and best practice documents for the Vrtx but all of these seem be so highly pitched that they lack actual technical details. Another issue is the Dell parts catalogue doesn’t really acknowledge the existance of the Vrtx system; I was talking to someone about extending the system with some Fibre Channel HBAs for FC storage connectivity but of all of the FC HBAs for sale on the Dell website, only a single port 4Gbps HBA is listed as supported which I can’t believe for one minute given the PCIe slots in the Vrtx are, well, PCIe slots. Disk performance on the Shared PERC controller is pretty impressive but networking needs to be taken with caution. If you are using the PowerEdge M620 half-height blade, it only exposes two 1GbE Ethernet interfaces to the internal switch plane on the chassis whereas the full height PowerEdge M520 blade exposes four 1GbE Ethernet interfaces and I would have really liked to have seen all four interfaces on the half-height blade, especially when building virtualization solutions with VMware vSphere or Microsoft Windows Server Hyper-V.

I haven’t really worked with VMware too much since vSphere 5.0 and working with vSphere 5.5, not an awful lot has changed. After talking with the customer in question, we opted to deploy the vCenter Server Appliance (vCSA). vCSA in previous releases of vSphere was a bit lacklustre in it’s configuration maximums but in 5.5, this has been addressed and it can now be used as a serious alternative to a Windows Server running vCenter. The OVA virtual appliance is 1.8GB on disk and deploys really quickly, and the setup is fast and simple. vSphere Update Manager (VUM) isn’t supported under Linux or on the vCSA so you do still need to run a Windows Server for VUM but as not everyone opts to deploy VUM, that’s not a big deal really. What I would say to the vCSA though is if you plan to use local authentication and not the VMware SSO service with Active Directory integration then I would still consider the Windows Server. Reason for this being that with the vCSA, you cannot provision and manage new users and password via the vSphere Web Client and instead you have to SSH onto the appliance and manage the users from the VI CLI. With the Windows Server then we can obviously do this with the Users and Groups MMC Console which is much easier if you are of the Microsoft persuasion. If you are using the VMware SSO service and Active Directory integration then this will not be a problem for you though.

Keeping it on the VMware train, I’m looking forward to a day out to the VMware UK User Group Conference VMUG in Coventry in two weeks. I’ve been for the last three years and had a really good and informative day every time I’ve been.

Being so busy on the customer project and with my head buried in VMware, I’ve been really slow on the uptake of TechEd Europe news which bothers me but fear not, thanks to Channel 9, I’ve got a nice list of sessions to watch and enjoy from the comfort of my sofa but with there being so many sessions that I’m interested in, it’s going to take me a fair old chunk of time to plough through them.

Thoughts on Windows Server 2003 End of Life

A post by me has a just been published over on the Fordway blog at http://www.fordway.com/blog-fordway/windows-server-2003-end-of-life/.

This was written in parallel to my earlier post Windows Server 2003 End of Life Spreadsheet, reproducing the spreadsheet for documenting your Windows Server 2003 environment originally posted by Microsoft. In this new post on the Fordway blog, I talk about some of the areas that we need to focus our attention and other up some food for thought. If you have any questions then please feel free to get in touch either with myself or someone at Fordway who will be happy to help you.

Monitoring SQL Server Agent Jobs with SCOM Guide

Late last night, I published a TechNet Guide that I have been working on recently entitled “Monitoring SQL Server Agent Jobs with SCOM”. Here’s the introduction from the document.

All good database administrators (DBAs) create jobs, plans and tasks to keep their SQL servers in tip top shape but a lot of the time, insight as to the status of these jobs is left either unturned like an age old stone or is done by configuring SQL Database Mail on your SQL servers so that email alerts are generated which means you have additional configuration being done on every server to configure this and it’s yet another thing to manage.

In this guide, I am going to walk you through configuring a System Center Operations Manager 2012 R2 environment to extend the monitoring of your SQL Servers to include the health state of your SQL Server Agent Jobs, allowing you to keep an eye on not just the SQL Server platform but also on the jobs that run to make the platform healthy.

You can download the guide from the TechNet Gallery at https://gallery.technet.microsoft.com/SQL-Server-Agent-Jobs-with-f2b7d5ce. Please rate the guide to let me know whether you liked it or not using the star system on TechNet. I welcome your feedback in the Q&A.

Windows Server 2003 End of Life Plan Spreadsheet

Last week, the folks over at Microsoft published another entry in their blog post series Best Practices for Windows Server 2003 End-of-Support Migration (http://blogs.technet.com/b/server-cloud/archive/2014/10/09/best-practices-for-windows-server-2003-end-of-support-migration-part-4.aspx?wc.mt_id=Social_WinServer_General_TTD&WT.mc_id=Social_TW_OutgoingPromotion_20141009_97469473_windowsserver&linkId=9944146) which included a visually appealing spreadsheet template for helping you keep track of and plan your Windows Server 2003 migrations but to my shock, they didn’t provide the actual Excel file for that design (shame on them).

I’ve copied the design and made it into an Excel spreadsheet which I’ve setup with Conditional Formatting in the relevant cells so that when you add your numeric values and X’s it will automatically colour the cells to help you keep it as pretty as intended as after all, we need a bit of colour and happiness to help us with Windows Server 2003 migrations right?

Click the screenshot of the Excel file below for the download. As a note, make sure you use the Excel desktop application and not the Excel web app to view or use this file as the web app appears to hurt some of the formatting and layout.

Server 2003 Migration Spreadsheet

UPDATE: If you want to read more about Windows Server 2003 End of Life, a post by me has been published on the Fordway blog at http://www.fordway.com/blog-fordway/windows-server-2003-end-of-life/.

Explaining NUMA Spanning in Hyper-V

When we work in virtualized worlds with Microsoft Hyper-V, there are no many things we have to worry about when it comes to processors. Most of these things come with acronyms which people don’t really understand but they know they need and these and one of these is NUMA Spanning which I’m going to try and explain here and convey why we want to avoid NUMA Spanning where possible and I’m going to do it all in fairly simple terms to keep the topic light. In reality, NUMA architectures may be more complex than this.

NUMA Spanning or Non-Uniform Memory Address Spanning was a feature introduced into motherboard chipsets by Intel and AMD. Intel implemented it with the feature set Quick Path Interconnect (QPI) in 2007 and AMD implemented it with HyperTransport in 2003. NUMA uses a construct of nodes in it’s architecture. As the name suggests, NUMA refers to system memory (RAM) and how we use memory and more specifically, how we determine which memory in the system to use.

Single NUMA Node

Single NUMA Node

In the most simple system, you have a single NUMA node. A single NUMA node is achieved either in a system with a single socket processor or by using a motherboard and processor combination which does not support the concept of NUMA. With a single NUMA node, all memory is treated as equal and a VM running on a hypervisor on this configuration system would use any memory available to it without preference.

Multiple NUMA Nodes

Two NUMA Nodes

In a typical system that we see today with multiple processor sockets and with a processor and motherboard configuration that supports NUMA, we have multiple NUMA nodes. NUMA nodes are determined by the arrangement of memory DIMMs in relation to the processor sockets on the motherboard. In a hugely oversimplified sample system with two CPU sockets, each loaded up with a single core processor and 6 DIMMs per socket, each DIMM slot populated with an 8GB DIMM (12 DIMMs total). In this configuration we have two NUMA nodes, and in each NUMA node, we have one CPU socket and it’s directly connected 48GB of memory.

The reason for this relates to the memory controller within the processor and the interconnect paths on the motherboard. The Intel Xeon processor for example has an integrated memory controller. This memory controller is responsible for the address and resource management of the six DIMMs attached to the six DIMM slots on the motherboard linked to this processor socket. For this processor to access this memory it takes the quickest possible path, directly between the processor and the memory and this is referred to as Uniform Memory Access.

For this processor to access memory that is in a DIMM slot that is linked to our second processor socket, it has to cross the interconnect on the motherboard and via the memory controller on the second CPU. All of this takes mere nanoseconds to perform but it is additional latency that we want to avoid in order to achieve maximum system performance. We also need to remember that if we have a good virtual machine consolidation ratio on our physical host, this may be happening for multiple VMs all over the place and that adds up to lots of nanoseconds all of the time. This is NUMA Spanning at work. The processor is breaking out of its own NUMA node to access Non-Uniform Memory in another NUMA node.

Considerations for NUMA Spanning and VM Sizing

NUMA Spanning has a bearing on how we should be sizing our VMs that we deploy to our Hyper-V hosts. In my sample server configuration above, I have 48GB of memory per NUMA node. To minimize the chances of VMs spanning these NUMA nodes, we therefore need to deploy our VMs with sizing considerations linked to this. If I deployed 23 VMs with 4GB of memory each, that equals 92GB. This would mean 48GB memory in the first NUMA node could be totally allocated for VM workload and 44GB of memory allocated to VMs in the second NUMA node leaving 4GB of memory for the parent partition of Hyper-V to operate in. None of these VMs would span NUMA nodes because 48GB/4GB is 12 which means 12 entire VMs can fit per NUMA node.

If I deployed 20 VMs but this time with 4.5GB of memory each, this would require 90GB memory for virtual workloads and leave 6GB for hosting the parent partition of Hyper-V. The problem here is that 48GB/4.5GB doesn’t fit, we have left overs and uneven numbers. 10 of our VMs would fit entirely into the first NUMA node and 9 of our VMs would fit entirely within the second NUMA node but our 20th VM would be in no man’s land and would be left to have half its memory in both of the NUMA nodes.

In good design practice, we should try to size our VMs to match our NUMA architecture. Take my sample server configuration of 48GB per NUMA node, we should use VMs with memory sizes of either 2GB, 4GB, 6GB, 8GB, 12GB, 24GB or 48GB. Anything other than this has a real risk to be NUMA spanned.

Considerations for Disabling NUMA Spanning

So now that we understand what NUMA Spanning is and the potential decrease in performance it can cause, we need to look at it with a virtualization lens as this is where it really takes effect to the maximum. The hypervisor understands the NUMA architecture of the host through the detection of the hardware within. When a VM tries to start and the hypervisor attempts to allocate memory for the VM, it will always try to first get memory within the NUMA node for the processor that is being used for the virtual workload but sometimes that may not be possible due to other workloads blocking the memory.

For the most part, leaving NUMA Spanning enabled is totally fine but if you are really trying to squeeze performance from a system, a virtual SQL Server perhaps, NUMA Spanning would be something we would like to have turned off. NUMA Spanning is enabled by default in both VMware and Hyper-V and it is enabled at the host level but we can override this configuration on both a per hypervisor host level and a per VM level.

I am not for one minute going to recommend that you disable NUMA Spanning at the host level as this might impact your ability to run your workloads. If NUMA Spanning is disabled for the host and the host is not able to accommodate the memory demand of the VM within a single NUMA node, the power on request for the VM will fail and you will be unable to turn on the machine however if you have some VMs which have NUMA Spanning disabled and others with it enabled, you can have your host work like a memory based jigsaw puzzle, fitting things in where it can.

Having SQL Servers and performance sensitive VMs running with NUMA Spanning disabled would be advantageous to their performance and having NUMA Spanning disabled on VMs which are not performance sensitive allows them to use whatever memory is available and cross NUMA nodes as required giving you the best combination of maximum performance for your intensive workloads and the resources required to run those that are not.

Using VMM Hardware Profiles to Manage NUMA Spanning

VMM Hardware Profile NUMA Spanning

So assuming we have a Hyper-V environment that is managed by Virtual Machine Manager (VMM), we can make this really easy to manage without having to bother our users or systems administrators with understanding NUMA Spanning. When we deploy VMs we can base our VMs on Hardware Profiles. A VMM Hardware Profile has the NUMA Spanning option available to us and simply, we would create multiple Hardware Profiles for our workload types, some of which would be for general purpose servers with NUMA Spanning enabled whilst other Hardware Profiles would be configured specifically to be used by performance sensitive workloads with the NUMA Spanning setting disabled in the profile.

The key to remember here is that if you have VMs that are already deployed in your environment you will need to update their configuration. Hardware Profiles in VMM are not linked to the VMs that we deploy so once a VM is deployed, any changes to the Hardware Profile that it was deployed from do not filter down to the VM. The other thing to note is that NUMA Spanning configuration is only applied at VM Startup and during Live or Quick Migration. If you want your VMs to update the NUMA Spanning configuration after you have changed the setting you will either need to stop and start the VM or migrate it to another host in your Hyper-V Failover Cluster.

Gartner Magic Quadrant Unified Communications

Well here’s one you wouldn’t have expected to see. Gartner have placed Microsoft and Lync ahead of Cisco in their Unified Communications Magic Quadrant.

Gartner have put Cisco and Microsoft level for Ability to Execute however Microsoft have been placed ahead in Vision. You can read the full article at http://www.gartner.com/technology/reprints.do?id=1-1YWQWK0&ct=140806&st=sb. Well done Microsoft. Now if work can be done to address the cautions that Gartner have identified then the position will be even stronger.

System Center Service Manager 2012 R2 Data Warehouse Reports Unavailable

Late last week, I had the pleasure of deploying and configuring a System Center Service Manager 2012 R2 Data Warehouse. I got informed today that none of the reports were available in the Reporting tab in SCSM so I had a look at what the problem might be.

With the SCSM Data Warehouse, the most important job during setup is one of the Data Warehouse Jobs named MPSyncJob. The MPSyncJob has the purpose of deploying all of the management packs from SCSM into the reports folders in SQL Reporting Services (SSRS).

When I looked at this job in the Data Warehouse Jobs tab under Data Warehouse in the SCSM Console, 175/181 has the status Associated but 6 of them were stuck with the status Pending Association and these were all the reporting management packs with this status. Viewing the Management Packs tab under Data Warehouse in the SCSM Console, I could see that these same 6 management packs had a Deployment Status of Failed which is obviously not good.

I logged on to the SCSM Data Warehouse server and poked into the Operations Manager log which is where SCSM records all it’s events and there were a number of critical alerts in the log with the Event Source Deployment and the message went along the lines of insufficient permissions to complete the requested operation so I knew immediately there was a permissions issue with SSRS. I headed over to the SSRS Report Manager URL which normally looks like https://SERVERNAME.domain.suffix/Reports_InstanceName and logged in as myself.

Viewing the permissions on the System Center and Service Manager report folders, I quickly could see that the account that I specified during the setup of the SCSM Data Warehouse was missing, the installer had not properly assigned the permissions to the account.

I manually added the permissions to the account and restarted the deployment of the management packs in a failed state and the Operations Log has now reported that they have successfully been deployed, happy days. Now I just need to wait for SCSM to complete all of the other jobs in the appropriate order to get the full functionality through from our Data Warehouse.

 

Active Directory and DFS-R Auto-Recovery

I appreciate this is an old subject but it is one that I’ve come across a couple of times recently so wanted to share and highlight the importance of it. This will be one of a few posts I have upcoming on slightly older topics but none the less important ones that need to be addressed.

How Does DFS-R Effect Active Directory

In Windows Server 2008, Microsoft made a big change to Active Directory Domain Services (AD DS) by allowing us to use DFS-R for the underlying replication technology for the Active Directory SYSVOL, replacing File Replication Service (FRS) that has been with us since the birth of Active Directory. DFS-R is a massive improvement on FRS and you can read about the changes that DFS-R brings to understand the benefits at http://technet.microsoft.com/en-us/library/cc794837(v=WS.10).aspx. If you have upgraded your domains from Windows Server 2003 to Windows Server 2008 or Windows Server 2008 R2 and you haven’t completed the FRS to DFS-R migration (and it’s easily overlooked as you have to manually complete this part of a migration in addition to upgrading or replacing your domain controllers with Windows Server 2008 servers and there are no prompts or reminders when replacing your domain controllers to do it), I’d really recommend you look at it. There is a guide available on TechNet at http://technet.microsoft.com/en-us/library/dd640019(v=WS.10).aspx to help you through the process.

Back in January 2012, Microsoft released KB2663685 which changes the default behaviour of DFS-R replication and it effects Active Directory. Prior to the hotfix, when a DSF-R replication group member performs a dirty shutdown, the member would perform an automatic recovery when it came back online however after the hotfix, this is no longer the case. This behaviour change results in a DFS-R replication group member halting replication after a dirty shutdown awaiting manual intervention. Your intervention choices range from manually activating the recovery task to decommissioning the server and replacing it, all depending on the nature of the dirty shutdown. What we need to understand however is that a dirty shutdown can happen more often than you think so it’s important to be aware of this.

Identifying Dirty DFS-R Shutdown Events

Dirty shutdown events are logged to the DFS Replication event log with the event ID of 2213 as shown below in the screenshot and it advises you that replication has been halted. If you have virtual domain controllers and you shutdown your domain controller using the Shutdown Guest Operating System options in vSphere or in Hyper-V, this will actually trigger a dirty shutdown state. Similarly, if you have a HA cluster of hypervisors and you have a host failure causing the VM to restart on another host, yep, you guessed it, that’s another dirty shutdown. The lesson here first and foremost is always shutdown domain controllers from within the guest operating system to ensure that it is done cleanly and not forcefully via a machine agent. The event ID 2213 is quite helpful in that it actually gives us the exact command to recover the replication so a simply copy and paste into an elevated command prompt will recover the server. No need to edit to taste. Once you’ve entered the command, another event is logged with the event ID 2214 to indicate that replication has recovered shown in the second screenshot.

AD DS DFS-R Dirty Shutdown 2213  AD DS DFS-R Dirty Shutdown 2214

Changing DFS-R Auto-Recovery Behaviour

So now that we understand the behaviour change, the event ID’s that lets us track this issue, how can we get back to the previous behaviour so that DFS-R can automatically recover itself? Before you do this, you need to realise that there is a risk to this change and the risk is that if you allow automatic recovery of DFS-R replication groups and the server that is coming back online is indeed dirty, it could have an impact on the sanctity of your Active Directory Domain Services SYSVOL directory.

Unless you have a very large organisation or unless you are making continuous change to your Group Policy Objects or files which are stored in SYSVOL, this shouldn’t really be a problem and I believe that the risk is outweighed by the advantages. If a domain controller restarts and you don’t pick up on the event ID 2213, you have a domain controller which is out of sync with the rest of the domain controllers. The risk to this happening is that domain members and domain users will be getting out of date versions of Group Policy Objects if they use this domain controller as the domain controller will still be active servicing clients whilst this DFS-R replication group is in an unhealthy state.

Effects Beyond Active Directory

DFS-R is a technology originally designed for replicating file server data. This change to DFS-R Auto-Recovery impacts not only Active Directory, the scope of this post but also file services. If you are using DFS-R to replicate your file servers then you may want to consider this change for those servers too. Whilst having an out of date SYSVOL can be an inconvenience, having an out of date file server can be a major problem as users will be working with out of date copies of documents or users may not even be able to find documents if the document they are looking for is new and hasn’t been replicated to their target server.

My take on this though would be to carefully consider the change for a file server. Whilst having a corrupt Group Policy can fairly easily be fixed or recovered from a GPO backup or re-created if the policy wasn’t too complex, asking a user to re-create their work because you allowed a corrupt copy of it to be brought into the environment might not go down quite so well.

SQL Server Maintenance Solution

Earlier this year, I posted about a tool from Brent Ozar called spBlitz and how it gives you amazing insight into configuration problems with your SQL Servers. Well today, I am here to tell you about another great SQL tool available for free online and that is the SQL Server Maintenance Solution by Ola Hallengren, a Swedish database administrator and was awarded a Microsoft MVP this year for the first time.

You can download his tool from https://ola.hallengren.com/ and on the site, there is full documentation for all of the features of the tool including the most common configuration examples and its use so you can get up and running really quickly with it.

The SQL Server Maintenance Solution is a .sql file that you download and allow it to install itself as a series of Stored Procedures in your master database. The tool works by invoking its Stored Procedures as SQL Agent Jobs and by default will create a number of these unless you opt not to during the install by changing one of the lines in the .sql file.

I opted to not install the default jobs but to create my own so I could configure how and what I wanted the scripts to do but it really is so simple that no administrator of SQL has any reason to not be performing good routine maintenance. I am using Ola’s scripts to both perform routine DBCC CHECKDB consistency and also to perform index defragmentation on databases which is it’s real power.

The reason Ola’s scripts beat a SQL Maintenance Plan for index defragmentation and the main reason I wanted to use them is that Ola gives us the flexibility to perform different actions according to the level of fragmentation so for example, I could do nothing if fragmentation in an index is below 10%, reorganise an existing index if fragmentation is between 10% and 30% and completely rebuild the index if it is over 30%. Compare this to a SQL Maintenance Plan where your option is reorganise or rebuild regardless of fragmentation level and you can see the advantage.

So now, that’s to the community and Brent and Ola, we can check the configuration of our SQL Servers to make sure they are happy and safe as well as easily configure our daily and weekly checks and maintenance on databases to keep our server and our databases happy and we all know that happy databases means happy software.

In another post coming up soon, I will show you how we can update the configuration of our SCOM Management Pack for SQL Server so that we can receive alerts for failed SQL Server Agent Jobs, allowing us to centralise our knowledge, reporting and alerting for SQL maintenance tasks.