Use site reliability engineering to address cloud instability

[ad_1]

Cloud platforms, as a remotely managed service, come with a service-level agreement (SLA) that guarantees an uptime percentage or your money back. These SLAs, and the shifting of responsibility of infrastructure maintenance from your organisation or colocation provider to the clouds in use in your organisation, have prompted an expectation that cloud services will “just work”, even though reality often falls short of that.

Computing infrastructure has become faster and cheaper over time, but a server today is not meaningfully more reliable than a decade ago, because the root causes of outages are often environmental or the result of third-party error.

Some outages over the past two years have been eyebrow-raising in their origin, effect or circumstances.

The fire that destroyed OVHcloud’s Strasbourg SBG2 facility in March 2021 was the result of a faulty repair to an uninterruptable power supply. Cooling systems failed to keep pace with the London heatwave in July, leading to outages at Google Cloud Platform and Oracle Cloud Infrastructure. Although not cloud-specific, the 2020 Nashville bombing damaged a significant amount of telecoms equipment, leading to regional outages.

Given a rise in global temperatures owing to climate change – and a rise in political temperatures – the potential for climate- or extremism-related outages is real.

Of course, comparatively mundane factors also lead to outages, such as bad software deployments, software supply chain problems, power failures and networking issues ranging in severity from tripped-over cables to fibre cuts. Naturally, no discussion of outages would be complete without a mention of DNS and BGP-related outages, which were cited as the root cause of incidents at Microsoft Azure, Salesforce, Facebook and Rogers Communications over the past two years.

Engineer like a storm is coming

You need site reliability engineering

Assume the worst, but hope for the best

Leave a Comment Cancel reply