Microsoft SLA Uptime Service Credits: Decoding the Fine Print
An IT professional meets a cloud services provider at a party and asks, "So, you're available? How available?" It sounds like a cheesy pickup line, but with today's vague and sometimes complicated SLAs (service level agreements), it is exactly the question you should be asking about your online services.
With regard to their cloud-based solutions, Microsoft has clarified their SLA calculations with the March 2016 release of the Microsoft Volume Licensing Service Level Agreement for Microsoft Online Services. The document details product-by-product applicable terms, the formula that is leveraged to calculate availability of the service, the uptime percentages, as well as the service credit that a customer has the right to claim in the event that the service is not available. (It is important to note that these credits are only available for customers who have purchased services via a volume license agreement. Customers who purchased open licenses or Office 365 small business solutions are credited in the form of service time as opposed to service fees.)
While some mission-critical applications call for high availability, a certain amount of downtime may be perfectly acceptable for some applications. The key is calculating the downtime in practical hours and minutes so that you understand exactly what the SLA covers. So, exactly how available are some of the more popular online services from Microsoft?
The New Microsoft Dynamics AX
The New Microsoft Dynamics AX, formerly known as AX 7, is currently provided as an online-only solution. (Other versions, such as Dynamics AX 2012, would be hosted in a chosen cloud environment and fall under a different SLA category such as Azure's Infrastructure as a Service (IaaS) Virtual Machines.) For the new AX, downtime is defined as, "Any period of time when end users are unable to login to their Active Tenant, due to a failure in the unexpired Platform or the Service Infrastructure as Microsoft determines from automated health monitoring and system logs." The document also defines a few of these other terms such as active tenant and service infrastructure. They effectively mean a high availability production deployment of the AX components and the additional authentication, compute, and storage resources that are combined to make the service.
This is a favorable SLA, as Microsoft is saying that if any component for which they are responsible fails, that equates to an outage. This is different than other SLAs that we often see in the marketplace that "silo" the various services to create "stacked SLAs".
The formula for downtime for the new AX is:
User minutes are defined as the total number of minutes in a month, less all scheduled downtime, multiplied by the total number of users.
The service credit you receive is based on the following chart:
Now, to truly understand an outage in minutes and seconds, some calculations are in order: New AX has a minimum user threshold of 50 users. Estimating on a 744-hour month (31 days), that equates to 2,232,000 user minutes per month. In this scenario, an hour of downtime would reflect an availability percentage of 99.997. In fact, based on this formula and 50 users, in order to receive a service credit of 25% the organization would have to experience more than 18.5 user hours (or 55,500 user minutes) of outage in a calendar month. While this SLA may be acceptable for some organizations, it may not be for others. By standard SLA terms, that type of outage is typically considered a 97.5% SLA. A standard 99.5% uptime guarantee allows for 3h 39m of actual downtime in a month.
Microsoft Dynamics CRM Online
Dynamics CRM follows the same formula as new AX above; however, there is no minimum number of users. In addition, the highest service credit percentage is set higher at 99.9%.
Many CRM deployments require more users, so for this example, we'll double the number of users from our AX calculation to 100 users. Again, estimating on a 744-hour month (31 days) for 100 users equates to 4,464,000 user minutes per month. In this scenario, an hour of downtime across 100 users would reflect an availability percentage of 99.998. In fact, you would need an outage of over 70 hours based on this formula to receive the 25% service credit. An outage of 70 hours equates 99.91% uptime. For many organizations, a brief CRM outage will not halt day-to-day business. But it is important to understand the hours and minutes difference between a “three nines” SLA and a “four nines” SLA.
Microsoft Azure Virtual Machines
Azure offers Infrastructure as a Service as well with many options for virtual machines (VMs) that can be interconnected with virtual networks, or connected back to on-premise networks via VPN or their Express Route solution. Customers deploy VMs to run traditional rich client applications and server based workloads. There are templates for all versions of supported Windows operating systems, as well as many Linux distributions. Some of the VMs are even pre-staged with SQL Server or SharePoint installed on them.
Downtime for Azure VMs is defined as the total accumulated minutes that are part of maximum available minutes that have no external connectivity. Microsoft defines maximum available minutes to be “the total accumulated minutes during a billing month for all internet facing virtual machines that have two or more instances deployed in the same availability set.” (An availability set in Azure is two or more machines deployed across different fault domains to avoid a single point of failure. In order to qualify for the SLA, you must deploy your workload in an availability set.) Maximum available minutes is measured from when at least two VMs in the same availability set have both been started, resulting from action initiated by you as the administrator to the time you have initiated an action that would result in stopping or deleting the virtual machines.
Microsoft has made it clear that the uptime guarantee is based on the time you actually use the VMs – which is a nice feature. However, there has likely been some confusion in the past where a customer tried to calculate the SLA based on the total amount of minutes in the month even though the machines were only in service for a few days/weeks.
Azure is a popular solution due to its flexibility and use for a wide variety of use cases. What causes concern, however, is a possible misunderstanding about the meaning of “no External Connectivity.” In its report, Microsoft describes external connectivity as “bi-directional network traffic over supported protocols such as HTTP, and HTTPS that can be sent and received from a public IP address.” If the SLA is only guaranteeing the external connectivity, then they aren't guaranteeing that your workloads will continue to run.
Again, this solution is a great choice for many use cases. But beware that if your use case is mission-critical and the machines don’t respond, you risk uncovered downtime.
Reading the fine print
Planning for acceptable downtime is essential to your user experience and cost control. Unfortunately many people see a 99.9% uptime guarantee headline and don't navigate the fine print of what is actually guaranteed, or understand the ramifications in practical hours and minutes terms. Your users' perception when one of these services goes down is that they experienced an outage, whether the SLA covers it or not. It is important to dive into the details of your SLA plan for acceptable risks. For more fun with numbers, check out my previous article where we dissect stacked SLAs.