Announcement |
July 22, 2013 |
As most of you know Modern Retail experienced an outage on May 7, 2013. This outage was the worst in our 15 year history. As previously reported, 3 Modern Retail employees immediately flew down to San Antonio to meet with Rackspace to perform a postmortem and figure out what could be done to prevent this sort of problem in the future. I'm going to attempt to explain the problem and the steps taken to mitigate future issues. I'm also going to explain some additional improvement made to our server farm as a result of our investigation.
PROBLEM
For years Modern Retail has operated a "Standby Database Server". This Standby Database Server automatically takes over in the event of a catastrophic hardware failure of the primary Database Server. This Standby Database Server comes online instantaneously or in some case within a few minutes when an engineer is needed.
On May 7th we had a failure of the primary database which required an engineer to get involved. Unfortunately due to human error the main data store became corrupted and neither database could be brought online.
All data is backed up and this backup was used to restore the database servers. However, due to the size of the database, it took several hours to restore the database.
SOLUTION
The corruption of the database is not something Modern Retail or Rackspace ever planned on. The likelihood of this data becoming corrupt, and not being able to failover to the Standby Database Server, is highly unlikely. Unfortunately, machines still listen to humans and the main data store became corrupt. To prevent this from happening again, we have upgraded our infrastructure to include a completely "Redundant Database Server and Data".
As orders, products and other transactions happen, they are written to a second Redundant Database Server in real-time. No longer is it possible to corrupt the data because there's a redundant copy of the data sitting off on a completely different database. Of course we're also making constant backups of both databases, both locally and onto tape just as before. We believe this new structure provides the ultimate security of your website data.
OTHER INFRASTRUCTURE IMPROVEMENTS
Modern Retail and Rackspace used our time together in San Antonio to rip apart our entire infrastructure and question everything. A team of 30 people dissected every aspect of our server farm to make sure it would stand up to scrutiny. As a result of these meetings and countless follow-up meetings, we performed dozens of upgrades to our infrastructure. The most notable ones being:
- Addition of 50% more capacity to our server farm.
- AlertLogic Intrusion Detection System (IDS) now "talks" to the Firewall allowing attackers to be blocked in near real-time.
- Networking upgrades to provide higher bandwidth and traffic.
All in all over 1,000 man-hours were spent on these upgrades and I'm happy to report everything is done and in-place. As always Modern Retail's goal is to provide over 99.99% uptime of your website and email, and we believe these new changes will help us reach that goal.
We know you rely on Modern Retail and we're committed in doing all we can to ensure your online success. If you have any questions please Submit a Request in Store Manager or call us at (800) 640-1826 option #2 for Support. We're here to help.
Thank you,
Todd Myers
Modern Retail