Comparing Service Level Agreements: Stacked SLAs, Reindeer Games and Defining the Cost of an Outage
I was working with a mid-market customer several years ago who found (after an outage) that an hour of downtime to their operation represented a loss of $8,000. This was a combination of the lost revenue, staff and facilities costs. They were a services firm that processed claims, and as such their product (human time) was perishable and once lost, couldn’t be replaced. Shortly after their downtime incident, they implemented completely redundant networks to prevent such an outage from happening again.
So what can you do today to help your organization better prepare and prevent outages? A key starting point is understanding the differences in availability and service level agreements. Then it is key to define and align service level agreements within your organization for your various systems.
You’ll see many different categories and guarantees of availability from 95% to 99.999% (“five nines”). For example a 99.99% (“four nines”) uptime SLA allows for 4 min 23 sec of outage per month or 8.6 sec per day compared with a 99.9% (“three nines”) SLA, which allows for up to 43 min 49 Sec per month.
For some applications, a three nines coverage is sufficient, but for others you need the four nines. To determine this you need to understand the cost for that 99.99%, and how it compares to the revenue earned or the outage costs of that system.
Beyond the uptime percentage, not all SLAs offer the same protection. Some companies like to play reindeer games with their uptime marketing. Beware of a “Stacked SLA.” This Service Level Agreement will be based on a service that has multiple components.
Examples: a cloud service that has separate SLAs for storage, computer and network services, or a solution that provides multiple connectivity options like a rich client and a web client. The challenge with these stacked SLAs is that you may experience an outage of one of the components, but because the other components are still available, it doesn’t trigger a violation of the agreement. Your user was unable to work, and as far as they experienced it there was an outage. Unfortunately the way the language is written in these agreements, you don’t have a right for damages based on only one component of the service being unavailable.
Do You Know How Critical Your Systems Are?
So how do you decide what SLA is right for your applications? As organizations migrate more workloads to the cloud, or outsource any business process for that matter, it’s important to determine how important each of their systems are.
- How critical is it to the organization?
What is affected if it becomes unavailable?
What are my expectations, or the expectations of the individuals who use the system, for availability?What’s the impact of an outage? Both financial and intangible damage to my brand? Or the cost of a potential lost customer?
Defining the Cost of an Outage
This is a simplistic example, but it describes a real world problem for many industries. If a company knows that it wants to earn $1 million in revenue and each object they sell earns $1,000, then they know they need to sell 1,000 objects. For this company to sell their goods, they require retail, order processing and/or ecommerce systems to be in place. If some of these systems are unavailable, the company would be hindered in making the necessary number of sales within the given time period. Depending on the industry, they may or may not be able to make up for this lost time.
If only the impact to your applications was as simple to calculate as multiplying a few sales. However, that doesn’t mean that you can’t gain a good understanding of an outage’s impact. Start with inventorying your processes and systems to understand all of the steps necessary from lead to order fulfilment (or revenue, depending on your industry and process). This will give you a visual into identifying and understanding your business applications. A great method for system thinking can be found at www.drawtoast.com. I recently stepped through Concerto’s quoting and workload sizing process and was amazed at how many steps and tasks made up the actions that my team and I do every day.
Once you have your list, try to define the amount of time spent or the revenue generated by each process. And from there you can begin the process of reviewing the availability needs of the applications. Do these systems require redundancy? What is the outage cost and are changes to implement a higher availability cost justified?
After going through a cost exercise, you’ll discover how important it is to align your service level agreements and ensure that the service levels as well as with cloud service providers meet your business needs.
What interesting ways has your organization defined or tracked SLAs within your company? I’d love to hear about your journey to the cloud in the comments.