Uptime is Everything
For web-centric companies like HootSuite, last Thursday was a worst-case scenario. With major service interruptions to Amazon Web Services, HootSuite was down for approximately 15 hours until our engineers restored service.
In general, we enjoy stellar performance with minimal outages on either HootSuite or Owly (our URL shortener) and now service over 3 million social networks sending over a million updates per day with almost zero downtime. We also have held strong during very active posting periods including the Japanese earthquake.
We know how important up-time is for you and truly appreciate the kind words from our users who missed using HootSuite. Further, many of you rely on HootSuite for your business and we take your trust seriously. As such, we’re taking all steps to prevent future mishaps.
What follows are notes about what we’re doing to “make it right” plus a technical breakdown about “what happened.”
Making it Right
Our Terms of Service to users outlines that we’ll provide refunds after a 24 hour outage. While this outage was significantly less, we acknowledge users were inconvenienced and we want to make things right.
With this in mind, we are offering a 50 Point credit (value $50) for the HootSuite Social Analytics tool to all users. Redeem your credit by May 13th, 2011 by using coupon code: HOOTREPORT and use the report credit within a month. For Pro and Enterprise customers, we’ll reach out via email with an additional coupon.
Note: After redeeming your coupon code, you’ll see it itemized on your invoice but the total won’t be updated until the billing date.
We are taking steps to increase redundancy of our services and data across multiple geographic regions. This was a bit of a unique outage which is highly unlikely to occur again, but we’ll be even more prepared for future emergencies.
Thanks for your continued support,
Ryan Holmes, CEO HootSuite
And now for technical updates…
It’s important to note that we enjoy a great relationship with Amazon Web Services and HootSuite was able to grow quickly due to their cloud computing offerings. However, technology can fail and in this case, the cloud zone hosting HootSuite went down (see Amazon Health chart), but we were able restore well in advance of the affected zones coming back online.
In brief, HootSuite has backups across multiple availability zones within Amazon’s North Virginia data center. We restored service relatively quickly by rebuilding our infrastructure into a new zone using backups of data.
Whatever happened on Amazon’s end is out of our control but, for your interest, here is a technical post-mortem from HootSuite CTO, Simon Stanlake:
“At approximately 01:00h PDT on Thursday morning (April 21) HootSuite began experiencing issues accessing EBS volumes on several of its AWS hosted instances. Critically, this included our production and slave database servers.
Following Amazon’s recommended best practices, we keep copies of our database across multiple availability zones at their North Virginia data center. Storing across multiple availability zones is meant to keep data always available, since availability zones are engineered to be highly reliable and independent of each other (see Amazon AWS FAQ). However in this case the outage affected EBS volumes on all availability zones, so we were essentially forced to sit and wait while Amazon worked to restore EBS access.
In the late afternoon on Thursday EBS access began returning to 3 out of 4 availability zones in the North Virginia region. However the remaining affected availability zone contained our production database, which was still not responding.
After waiting for several more hours with no sign of improvement we made the decision to cut over to a backup copy of the database in a working zone and re-spun our entire infrastructure there. This brought HootSuite back online at approximately 19:00h PDT but necessitated rolling back any changes made after 22:30h PDT on Tuesday April 19. We hated to make this decision but it turned out to be the best option, as Amazon did not restore service to our production database until 03:57h PDT, Sunday April 24.”
Lost in the Tubes
Since we restored from a database, there is a period of missing data from about Tuesday 22:37h PDT until about Thursday 19:20h PDT). We are working to resolve these anomalies as accurately and quickly as possible. If you were working with HootSuite during this time, please note the following:
- Any new users from this period will have their accounts recreated and payment status restored (and also tidy up duplicates) however, you’ll have to re-add social networks, search streams, draft and scheduled messages etc.
- [UPDATE] Any existing users who changed their payment status should post a ticket to the Help Desk for assistance restoring their lost changes
- Any messages scheduled to send during the outage were not delivered, plus any messages scheduled during the mentioned period will need to be re-added
We appreciate your patience and invite you to post a ticket at the Help Desk, if you require assistance with your account.
Keeping you Informed
Finally, this outage provided an opportunity to test our Emergency Messaging Procedure. To keep you informed, we posted updates via multiple Twitter accounts (remember we’re international) using a staging server instance of HootSuite. We also tracked progress on the HootSuite Facebook Page, Help Desk and blog.
Here are a few more articles about the outage:
- CNN – Why Amazon’s cloud Titanic went down
- Forbes – Lessons Learned From Amazon’s Cloud Outage
- Forbes – Will Amazon Outage Stop GovCloud?
- RWW – Amazon Web Services Starting to Come Back Online but Problems Persist & Questions Arise
Thanks to those who noticed our messaging strategy and we’re happy to hear your feedback about how to keep you best informed during unexpected outages.