The principals at Lantern Three have been collaborating on a newsletter for the past month. Our intention was to send it out this morning, but instead I discovered Constant Contact was down: (incidentally: terrible 404 page — this is where fail whales and exploding robots can pay off in spades):
Shortly after 10am Pacific, it appeared their marketing website, applications, and APIs were down; I skimmed their whois record:
Then I notified their admin email (which, in their shoes, I would appreciate):
I never received a reply, which I understand. As it turned out, their website and services were crippled by a car that crashed into a power pole near their data center. This email marketing firehose — née spam cannon — costs its customers no less than $15/month!
The outage continued throughout the day, well into the evening:
As a customer, Project Manager, and career IT professional, I appreciate the regular updates through their blog and Twitter, but I don’t think it was sufficient. Constant Contact should have tested that redundant systems failed-over properly, and drilled their DR plan to ensure they could quickly recover in just such an event. They owe their customers a full root-cause failure analysis, delivered publicly within 30 days. They should also consider a pro-rated refund for April service charges to each subscriber.
Learn from Constant Contact’s oversight
- Have a disaster recovery plan that covers the spectrum from neutron bomb to car vs power pole
- Update your disaster recovery plan no less than quarterly
- Drill your disaster recovery plan no less than once year
- If you do drop the ball, keep your customers up to date and resolve to their satisfaction
Constant Contact updated their blog with additional details on the outage:
UPDATE: 4/19/2014 at 5:32 p.m. EDT – Friday morning around 10:32 am ET, our primary service site experienced a major power disruption. Many of the redundant systems that should have kicked in immediately failed to do so. We do not yet know why but are working with our data center provider to get to the bottom of this. The power outage caused our systems, as well as the systems of other companies hosted at the site, to shut down. Based on having 90 minutes of unstable power and the abruptness of the way our systems shut down, we had to completely restart all systems. We did this to ensure the integrity of our customers’ data, and because methodically restarting all applications was the best way to make sure we got everything running in a safe and stable way. We were able to restore our website first. The additional work of shutting down all other applications, restarting them, and verifying their status took us until 1 am Saturday morning. At all times, your account information and data was fully secure. We are actively working with the data center facility to learn what went wrong and plan a full assessment of our own systems to ensure that this does not happen again. We anticipate having more information in the coming days to share with you. We appreciate your patience.