How to protect your data pipeline against the next cloud outage
Every year, businesses grow increasingly dependent on cloud infrastructure to power their data stack. This year, cloud services are already set to process an incredible 94% of workloads. This massive growth is not surprising. Cloud computing offers many advantages such as greater flexibility, increased speed and agility, lower latency, cost optimization and more.
Yet although reliance on cloud services is expanding, the infrastructure businesses depend on is not always as reliable as many assume. Even major cloud providers including AWS, Azure and Google, experience service outages and failures. And for modern, data-powered organizations, any “data downtime” translates into real costs.
Many businesses are choosing to accept the risks of cloud outages. However, without a strategy or solution in place to minimize the impacts of these outages, your business can end up losing data, users and revenue. In this article, we’ll look at three major cloud outages from the past year, the potential cost of a data outage, and how to safeguard your business against future outages.
Three of the biggest cloud outages of 2020
Data center outages are unpredictable, and even the most advanced technology is not failproof. Frequent causes of data center outages include human error, network failure, power outages and natural disasters. In 2020, three prominent cloud outages stood out:
Azure outage: March, 2020
Microsoft’s Azure, one of the fastest-growing clouds, went down on March third for 6 hours in its most popular US East region, limiting the availability of Azure’s cloud services for a number of North American customers. Overheating in the data center because of malfunctioning building controls caused the outage. The cloud computing platform then experienced another outage on March 24, which was attributed to increased traffic from the COVID-19 pandemic.
AWS outage: November, 2020
Amazon Web Services (AWS), the most widely-used cloud computing service in the world, experienced a 14 hour outage when “all servers in the fleet exceeded the maximum number of threads allowed by an operating system configuration”. The outage impacted thousands of businesses, ranging from Roku to The Washington Post. Apart from disruptions and data loss, businesses had to deal with frustrated customers trying to understand why their services weren’t working as expected.
GCP outage: December, 2020
Google Cloud experienced a widespread outage affecting many of its services, including BigQuery, Google Cloud Storage and Kubernetes. Google identified the issue as an accidental reduction of capacity on their central identity management system, causing requests that required authentication to fail, such as Google Workspace services, including Gmail, Calendar, Docs and Drive, as well as the Google Cloud Platform.
Even though the outage lasted for only 47 minutes it was enough to show that even the biggest tech companies in the world are not immune to errors. Google experienced at least two other significant outages in 2020, however the December outage was notable because it impacted organizations running business critical services, as well as personal applications.
Although none of these outages proved to be catastrophic, they do remind users that cloud outages are not isolated incidents, and many end up leaving businesses dealing with unexpected costs.
The cost of data downtime on your business
A data center outage can have a ripple effect on your business depending on the duration of the outage and affected services. According to Gartner, data downtime costs an average of $5,600 per minute, depending on factors such as business size, industry and business model. Let’s look at four business areas data outages can affect and the possible implications of each.
When a cloud service provider experiences an outage, data collected by the business using the cloud provider might get dropped before it lands in its storage target. What’s more, the provider might not be able to collect any data at all, which could lead to data that is lost for good.
Data loss has a direct impact on the quality of your data, thus affecting the quality of your data products as well. This is especially disruptive for businesses relying on the data for reporting or key use cases such as machine learning applications, recommendation engines, marketing attribution and analytics products. For example, following an AWS outage in 2011, content analytics company Chartbeat had to inform customers that they would lose approximately 11 hours of historical data and experience gaps in their timeline.
Modern businesses build their applications and services on top of large data sets hosted in data centers. If the data center goes down and the data pipeline fails, the applications go down with it, crippling the business. During the outage, your website and applications are no longer accessible, which can translate into lost data, sales and customers, and negatively impact your brand’s reputation.
Impact on brand reputation
Application downtime not only affects sales and revenue, but can also impact a brand’s reputation. Users might overlook one or two service disruptions, but frequent incidents can cause your brand to lose credibility in the eyes of your customers who will turn to competitors with more reliable services. Once a brand’s reputation takes a hit, it can be hard to win back.
Lost revenue: every minute counts
The combination of service disruptions, data loss and decrease in brand reputation translate into lost revenue because of the impact these factors have on your sales and customer retention. Depending on the business model, size and vertical, the average cost can range from $140,000 to $540,00 per hour.
Industries considered most at risk include retail, media, healthcare, banking/finance and transportation. When it comes to company size, one survey found that for large enterprises with over 1,000 employees, a single hour of downtime per year costs over $100,000, while 81% of organizations report that the cost exceeds $300,000. Even more significantly, downtime costs 33% of enterprises more than $1 million or more per hour.
Cloud infrastructure outages are unpredictable and costly, and any business relying on data should have a strategy in place to minimize these risks. The next section covers different approaches you can take to protect your business from data loss, even if your cloud provider experiences a severe outage.
Protecting your business from data outages
Many businesses are starting to realize the risks associated with cloud infrastructure outages. The good news is that several approaches exist to safeguard against unexpected outages.
Taking a multi-cloud approach means leveraging cloud services from more than one provider. For example, a single business could use a mix of AWS, GCP and Azure cloud services depending on which cloud is the best fit for each workload. You can also protect yourself against an outage by dynamically moving workloads between two clouds, so in the event of an outage in one, you can dynamically (and immediately) shift the workload to another provider.
Other benefits of this approach include lower costs, higher performance, greater flexibility and not getting locked into a particular service or ecosystem. However one drawback of taking a multi-cloud approach is that once you split your workloads across many clouds, additional resources need to go into managing security, governance and resource management. To find out more about the benefits and challenges of going multi-cloud, you can download our guide here.
A hybrid cloud approach is similar to multi-cloud, with one key difference. Instead of leveraging only public cloud services, you add a private cloud to your data infrastructure. This mix of a multi-tenant (public) and single-tenant (private) cloud strategy also gives your business greater flexibility to move workloads between clouds as costs or needs fluctuate.
What’s more, a hybrid cloud approach helps protect your business against data downtime by replicating business-critical data to the cloud and ensuring scalability in the event of a massive spike in demand.
An added benefit of a hybrid cloud approach, is that it gives businesses more control over data security, making it possible to house business-critical, sensitive data on their private, on-premise servers while offloading less sensitive data and applications to the public cloud.
Architecting multi-region pipelines within the same cloud is another means of achieving highly available systems in the face of regional downtime. Multi-region works like multi-cloud, but instead of having the possibility to move workloads between clouds, you can move workloads between different regions within the same cloud. This means your pipeline can span multiple distinct regions so in the event of a regional failure (or even a multi-region failure) data will still be able to land in a safe place, allowing your business to avoid costly data-loss.
Multi-region is easier to implement than multi-cloud, because it is less difficult to move a workload to another region than moving it into a separate cloud, which uses different APIs, SDKs and workflows. Like the multi-cloud approach, high-availability and redundancy comes at a price, so you will likely need someone dedicated full-time to managing your architecture and making sure it works in the most cost effective way.
Snowplow Outage Protection
At Snowplow, we believe outage protection should be built into your data collection. From our experience working with hundreds of data-powered businesses, we know any loss of data can have significant and costly consequences.
We saw firsthand how November’s AWS outage affected customers, and in response, we decided to build an optional Outage Protection solution into Snowplow BDP. This way, when severe, unexpected and prolonged outages happen in the future, our multi-region approach ensures pipelines are set up in “backup” regions where traffic is immediately re-routed in the event of an outage in the main region your pipeline is set up in, minimizing the risk of data loss.
Snowplow outage protected pipelines in backup regions are deliberately minimally specced to minimize additional costs, but quickly scale up in the event of an outage to ensure that data loss is minimized. For example, during November’s AWS outage, with Outage Protection your downtime would have been cut down to a few seconds compared to several hours.