Enterprise IT must deal with service outages as a fact of life. Vendors and consumers rely on enduring systems, backups, and a variety of disaster mitigation measures to limit the danger of an IT outage in the age of cloud computing and connected networks. Despite this, the largest network outages are controlled by pioneers of cloud infrastructure technologies, which power the majority of the world’s technology-driven businesses. Here are 9 examples of some of the biggest network outages of the decade:

1. Dyn Cyberattack:

On October 21, 2016, a series of distributed denial-of-service (DDoS) attacks targeted Dyn’s DNS services. The attack rendered major Internet platforms and services unavailable for vast areas of consumers in Europe and North America. The hacktivist collectives Anonymous and New World Hackers claimed responsibility for the attack; however, there was little evidence to back up their claims. As a DNS provider, Dyn offers end-users the service of translating an Internet domain name to its matching IP address when entered into a web browser, for example. Multiple DNS lookup requests from tens of millions of IP addresses were used to carry out the distributed denial-of-service (DDoS) attack. The attacks are thought to have been carried out by a botnet made up of many Internet-connected devices infected with the Mirai malware, such as printers, IP cameras, residential gateways, and baby monitors.

2. British Airways IT Failure:

Due to a problem with its check-in systems, on August 7, 2019, British Airways was forced to cancel over 100 flights from Heathrow and Gatwick — 117 and 10 flights respectively – causing more than 200 aircraft delays across Britain’s major airports. The breakdown is believed to have occurred on two BA’s systems, one for check-in and the other for flight departures. Due to an IT failure, check-in had to be done manually, resulting in lengthy delays. British Airways paid more than £8 million in compensation to customers stranded and affected by the IT system outage. However, this isn’t the first time British Airways has experienced an IT problem. In 2018, the airline was the victim of a cyber-attack that exposed the personal information of approximately 500,000 people, far more than was initially assumed. In 2017, the airline had an IT outage that cost IAG (the airline’s parent company) about £80 million over the May bank holiday weekend.

3. Amazon Web Services:

A year after the huge AWS S3 outage in February 2017, AWS customers such as Atlassian, Slack, and Twilio, which supply crucial enterprise IT solutions, faced disruption in March 2018. High error rates in Amazon’s S3 web-based storage service caused partial or complete outages on several popular websites, apps, and devices. Quora, Business Insider, and Slack were among the sites affected. The problem is likely to have started at its US-East-1 data center and may have been averted if developers had dispersed their applications across many facilities. This wasn’t an isolated instance, to be sure. In September 2015, the service experienced similar interruptions owing to software issues at a data center in North Virginia.

4. Vodafone Data Centre:

Businesses spend millions of dollars each year to protect their data servers from various cyber-attacks. This is money well spent, given the ever-changing nature of cyber threats. Some businesses, on the other hand, may underestimate the importance of physical security. On 2011, Vodafone discovered this the hard way after sledgehammer burglars stormed into its Hampshire data center. The intrusion resulted in significant service disruptions, prompting a barrage of angry customer complaints.

5. Microsoft Azure:

Summers in the Nordics were the warmest on record in 2018, an area that is popular with cloud service providers due to the nearly endless free cooling offered for data center hardware. As a result, when the temperature reached an 18°C in the Ireland region, Microsoft Azure had an outage. The temperature was too hot for the region, causing a water crisis for locals and leaving Microsoft with insufficient cooling to keep its Dublin data center resources operating at optimal temperatures. As a result, the data center service was unavailable for approximately 9 hours, affecting Azure and Office 365 subscribers in Northern Europe.

6. Google Cloud:

Google, like many major cloud service providers, has struggled to supply infrastructure services to an ever-growing user base. Customers like Spotify, Discord, the Pokemon Go app, and Snapchat rely on these cloud networking services to reach a global audience and spread their message. On July 2018, approximately 87 percent of customers had issues on the App Engine, HTTPS Load Balancer, or TCP/SSL Proxy Load Balancer solutions during the outage, which lasted about 30 minutes. The Google App Engine, Stackdriver, Dialogflow, and Global Load Balancers were among the services affected. As a standard compensation, every cloud vendor issued credits refunds to affected customers following the Service Level Agreement (SLA). However, according to a Ponemon Institute study report, the actual cost of data center downtime, which averaged roughly $750,000 in 2015, greatly surpassed the given compensation.

7. O2 Outage:

Customers of O2 mobile services in the UK were affected by the most extensive network outage in terms of scope. The outage, which began in the early hours of December 6, 2018, left 30 million customers without Internet access. The outage lasted the entire day and was triggered by a malfunction on Ericsson networking equipment that served multiple carriers worldwide. Given the scale of the problem, Ericsson jumped right in to fix it and decommission the defective software later. According to a thorough investigation, the fundamental reason was linked to expired certificate versions linked to customers, notably O2. Security certificates are used by telecommunication services to validate the legitimacy of network traffic routing and security decisions made at different tiers of the communication infrastructure. Once the source of the problem was identified, all users’ services were swiftly restored. In October 2018, O2 experienced a network outage that affected millions of subscribers, but it only lasted 40 minutes.

8. CenturyLink:

Millions of users could not call 911, make ATM withdrawals, access sensitive patient healthcare records, utilize Verizon mobile broadband services, or even participate in lottery drawings due to the CenturyLink outage, which was the most significant network outage of 2018. Following the incident, the FCC launched an investigation into the “unacceptable” outage that impacted emergency services such as 911 and ATM withdrawals. The outage lasted two days and was caused by a single network management card malfunction. The device was discovered sending out incorrect data frame packets over the network.

9. Deutsche Telekom:

On November 27, 2016, network failures affected hundreds of thousands of Deutsche Telekom users in Germany. As many as 900,000, or around 4.5 percent of Deutsche Telekom’s 20 million fixed-line subscribers, were hit by internet disruptions that began on the 27th and lasted until the next day when the number of affected consumers began to drop rapidly. The failures looked to be linked to a bungled attempt to turn a large number of customers’ routers into part of the Mirai botnet, according to Deutsche Telekom’s head of IT security Thomas Thchersich. Mirai is a malicious program that transforms network equipment into remotely controlled “bots” that may be used to launch large-scale network attacks.

Over the last decade, the dramatic expansion of mobile usage and Wi-Fi connectivity has created new connectivity needs and issues for businesses and network providers alike. While much has been done to secure network infrastructure and develop software that can protect against emerging attacks, there is still a potential that comparable failures could occur in the future. While such occurrences are uncommon, the sudden and unpredictable nature of outages can expose even the smallest network or security weaknesses. As a result, it’s critical that organizations and suppliers do their homework when it comes to planning for them. Most firms already do risk assessment, planning, and provisioning, but to be truly prepared, they must ensure that their IT architecture has flexibility, security, and resilience built in.

References