The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastrophic, cascading technical failure that apparently started with a “small fire” in Delta’s datacenter.
Multiple news outlets have relayed this story about the fire, so I can’t speak to how Delta has its IT network designed and deployed. But I can say three things for sure.
First, our hearts go out to Delta for having to go through the mother of all business disruptions. It’s a tribute to the organization’s leadership, tenacity and resourcefulness that just a few days later, they were back online and operating normally again.
Second, if what I’m reading is true, this entire mess may have been avoidable — or at least, easier contained.
Third, I was one of the Delta travellers last week that was inconvenienced by the outage. It wasn’t fun.
Since our inception in 2011, we’ve been promoting cloud services as a means to decrease an organization’s risk. Much of the current cloud conversation is around cybersecurity and how, in our datacenters, we deploy state-of-the-art security measures by employing world-class security experts who have a command of best practices, the digital threat landscape and compliance standards.
But what we—and I dare say other cloud service providers—do not talk about nearly as often is disaster prevention. The term disaster prevention goes beyond disaster recovery (DR) and data backups, and yet most companies aren’t prepared for the unexpected. We consider high availability in multiple datacenters to be “table steaks” in the modern cloud/infrastructure world. This outage is proof that it’s not.
According to the Disaster Recovery Preparedness Benchmark, more than 60% of those who took the survey do not have a fully documented DR plan. Another 40% admitted that the DR plan they currently have did not prove very useful when it was called on to respond to their worst disaster recovery scenario.
Unfortunately, floods, tornados, storms, earthquakes, blackouts, and yes, fires happen. Theft and sabotage are security concerns, too. When a datacenter gets physically compromised, very expensive hardware (not to mention the sensitive data that resides on it) has a way of walking out the door. And in cases of in-house (on-premise) data centers, entire servers have been wiped in the hands of disgruntled IT staff.
And so, the lessons to learn from the Delta outages are:
1) Ensure physical security, safety and personnel security measures are in place - including having appropriate background checks and security clearance for employees, partners and vendors.
2) Ensure there are rigorously tested, proven failover protocols in place. If you are working with a cloud provider, clearly understand their failover offerings. For Concerto environments, automatic failover to another data center is included for mission-critical applications. Many providers sell this as an add-on service.
3) Compare your own organization’s SLA with that of a proven cloud provider. Too many companies who manage their own datacenters do so with an undefined SLA to their organization. Determine what is appropriate for your computing workloads and risk should disaster strike.
4) IT leaders must balance the uptime requirements and risk across a myriad of applications, and I respectfully suggest you treat Delta’s story as a call to action. It may be time to conduct a comprehensive audit of your datacenter security and disaster protocols, just to be sure. And if/when your organization wants to reduce your risk with an uncompromising “four nines” SLA and disaster prevention services —we’ll be here to help you find higher ground.
5) Have a solid communication plan in place for after something bad happens. Hey, things will go wrong. If everyone knows what to do (and what to say) and how to make it up to customers, it will help minimize the impact.
Comparing Cloud Providers: 10 Questions to Ask about Uptime
Microsoft SLA Uptime Service Credits: Decoding the Fine Print
Ten Vital Facts Every Exec Must Know About Cloud