Yesterday, Amazon.com, arguably one of the world’s most popular websites, went down for a period of time. There are no indications (yet) that other websites were affected.
As many network engineers know, Amazon maintains massive data centers around the world in order to keep their website up and running. Even just a few minutes of downtime have massive financial implications, and results in thousands of dollars in lost revenue. It only makes sense, therefore, that Amazon would also sell Cloud Computing services to other companies.
Unknown to many, Amazon is actually one of the world’s largest Web Hosting and Cloud Computing providers. Thousands upon thousands of companies host their websites on Amazon’s platform because they (Amazon) are known as one of the most reliable companies in the web hosting business for high capacity needs.
While no other websites went down for the duration of yesterday’s outage (that we know of), Amazon did experience a massive network outage on April 29, 2011, which did affect thousands of their customers (but did not affect their own website). Some of these (large) websites were down for up to 3 days over that infamous weekend!
Other large companies have experienced outages recently. In September of last year, GoDaddy’s DNS servers experienced a 6-hour outage, which affected up to millions of its customers. On April 3rd, CNBC published an article indicating websites for huge banks had gone down for hundreds of hours in recent weeks.
There are normally just a few reasons why large websites fail:
- Websites fail because of internal mistakes
- Websites fail because of external attacks
Websites fail because of internal mistakes
According to the Visible Ops Handbook, on average, almost 80% of outages related to Information Technology are self-inflicted. According to Amazon’s write-up from their network outage in 2011, the cause was an employee’s mistake. Similarly, GoDaddy admitted that their DNS outage from last fall was due to internal issues.
All computer systems and networks are built by fallible humans. No human is perfect, and no computer network is perfect. As a result, companies have spent millions (billions?) of dollars into building redundant systems and safeguards to minimize the risk of an outage actually taking place.
But one of the best ways to prevent internal mistakes from taking down a system, according to ITIL (Information Technology Infrastructure Library) best practices, is to implement strict change management policies.
Another way to prevent internal mistakes from crashing your website, or worse, is to not stretch your employees too thin. According to an article from 2012, workplace fatigue causes up to $31 billion in extra costs. Similar arguments could be made for employees who are forced to multi-task too much.
As a nonprofit, your resources are limited. But can you really afford for your website (or office network, even) to go down for 3 days or longer? Our goal is to simplify information technology (including web hosting) for our nonprofit clients, so that their employees can focus on their core capacities. A web developer is often times not a server administrator or network engineer, just as an accountant is not a mechanic. You wouldn’t want your accountant to fix your car engine, would you?
Try to let your employees focus on their skills, and leave the rest to other employees or 3rd party providers who do specialize in the needed skill.
Websites fail because of external attacks
If 80% of IT outages are self-inflicted, then where do the other 20% of outages come from? Even the most careful planning cannot prevent all IT outages from occurring. Between natural disasters, power outages, and other external forces such as black-hat attackers (bad hackers), there are many forces you must reckon with to keep your network and server hosting your website operational.
As mentioned before, there are certainly steps you can take to mitigate the impact of these unknown external forces, such as:
Building Redundant Systems in Geographically Diverse Locations
Integrating servers and networks in such a way to automatically “fail over” to a backup system in the event the primary system goes offline is a very valid and accepted practice. In many cases, it is standard practice.
Clearly documenting your infrastructure, Internet Service Provider information, and network topology
In the event that your website comes under a Distributed Denial of Service (DDoS) Attack, the last thing you want to do is scramble to find your account numbers and technical support information for your website’s hosting provider and/or the your internet service provider’s information. Once a DDoS reaches your web server, it is too late. Unless you also maintain a robust network (including a hardware firewall) in front of your webserver, you have no option, and will never be able to bring your website back online by yourself (unless the DDoS attack goes away).In events such as a DDoS, you must contact your web hosting provider and/or your internet service provider, and they must be willing to spend the time to mitigate the attack. Your server does not have enough resources by itself to mitigate a DDoS, and it never will.
Be ready to call the experts, and have your account numbers, telephone numbers, and other pertinent information clearly documented.
It is a fact of life that websites and networks can, and do, fail. No website is perfect, and as we saw yesterday, even some of the most successful websites are prone to outages.
These are just some of the things your nonprofit organization can do to prepare for unwanted disaster.