One errant keystroke to blame
SEATTLE — It turns out that it was a simple typo that caused the Tuesday internet meltdown that affected thousands of websites.
Amazon’s Simple Storage Service (S3) — a cloud storage provider dubbed the backbone of the internet because more than 150,000 websites, including giants like Netflix and Pinterest, use it to store video, data and other content — went down for more than four hours, taking many of its clients down with it.
Amazon said on Thursday that it has gotten to the bottom of what caused the blackout: a small typing error.
“Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended,” the Seattle-based company said in a statement.
Amazon explained that its team was looking into an issue that caused its S3 billing system to run slowly. To figure out what caused the slowdown, Amazon allowed its staffers to take a small number of its servers offline — a move that wouldn’t have been a big deal because the few selected servers don’t host crucial operational mechanisms.
But an Amazon employee accidentally entered a typo into the command to take down the servers, and a whole lot more were knocked offline. At least two of those mistakenly shuttered servers contained vital systems: One held information about the location of East Coast customers’ data and the other supported storage for the massive number of files that S3 hosts.
To get the mistakenly-dismantled servers back online, Amazon restarted the systems, a process that took more than four hours and left many S3-supported websites in the dark.
A blackout about every two years
Since 150,714 websites use S3, the consequences of the blackout were far-reaching.
News site Business Insider, GIF-maker Giphy, question-and-answer site Quora, blog platform Medium and newsletter provider Sailthru all had glitches. Teams couldn’t share files in Slack, the popular workplace communication tool. Even some smart light bulbs and thermostats went down.
Still, the blackouts are not entirely unheard of, as one seems to make major headlines every other year.
In August 2015, a S3 server in Virginia glitched, leaving much of the Northeast without access to Netflix and Reddit for about five hours. The early-morning outage also affected movie-information website IMDB, smart home company Nest and a handful of other sites.
A 2013 outage took down social media sites Vine and Instagram.
‘Incredibly reliable’ but still room for improvement
In the wake of the 2017 blackout, Amazon said that it’s working on system changes that will allow servers to reboot more quickly. If one goes down in the future, it should be able to recover faster, potentially shaving hours off the outage time.
The company also vowed to increase safety checks and run more regular maintenance on S3. Before the outage, some of the servers had gone several years without a proper reboot, Amazon admitted.
And Amazon is cracking down on typos. To ensure that staffers don’t accidentally do damage with an errant keystroke, Amazon has rewritten its code so that engineers can’t remove a server if it would take S3’s operations down with it.
“We want to apologize for the impact this event caused for our customers,” Amazon said in its statement. “We will do everything we can to learn from this event and use it to improve our availability even further.”
But don’t take the outage — and Amazon’s promise to improve — as signs that cloud storage isn’t safe. Instead, the rare failure should highlight how consistent service has been, said Dave Bartoletti, an analyst for business management consultant Forrester.
“I don’t think it fundamentally changes how incredibly reliable the S3 service has been,” he said.