Outage Report [Archive]

Jin

20-10-2014, 11:20 PM

Firstly, massive thank you to all of those involved in getting us back up and running. At this stage we are all a bit tired and weary from the recovery attempts we have undergone in the last 36 hours.

We are still investigating the root cause of the downtime, it was determined that we lost our normal service sometime around 5am GMT on Sunday 19th October. Recovery attempts promptly resumed and at first we were unable to remotely connect to the server, this resulted in an engineer having to reboot the server physically.

During this reboot some our tables became largely corrupted and we had several issues in getting our mysql server to stand up, through perseverance Tom managed to resolve one error after another before loading in the last available backup.

Unfortunately our last backup from the period of the 18th -> 19th was unable to complete before the errors began we were therefor forced to use the 17th -> 18th backup.

However upon loading the backup it became clear that there has been an ongoing issue with the backup script in that it had been skipping over some critical files to complete the backup in particular files essential to our storage engine. This brought around the painstaking manual piecing together of our database as essentially the data was there but for some reason was inaccessible to the database server.

As a result we are conducting the following steps over the next week:

Modifications to our backing up procedures.
Feasibility in hot swapping servers (virtual instances)
Root cause analysis.

We thank you for your patience whilst we dealt with a difficult situation and apologies for the outage.