MODX Cloud’s paas2.ams platform experienced a prolonged outage starting Friday, Jan 18. The following takeaways and action plan are provided so customers can understand what occurred and how we will act in the future for similar events.
We sincerely apologize for this outage, fully understand its critical importance, and welcome your feedback via support tickets from your MODX Cloud Dashboard.
It was ultimately determined that a RAID controller failure was the reason for the outage, an unfortunate and rare occurrence that was extremely unlikely to happen. This determination took multiple rounds of back and forth over approximately 16 hours to conclude with IBM Cloud, after which we were able to initiate our disaster recovery plan.
During this outage, some customers elected to self-recover by using the Dashboard to migrate sites to other data center locations or platforms in MODX Cloud.
After getting the usable hardware back from IBM Cloud’s Data Center and Support teams, we reprovisioned the server and performed a recovery of the platform from backups.
10:37 AM - Monitoring alerted us to site outage on p2a
10:39 AM - Tier 1 support investigation begins
11:10 AM - Escalated to Tier 2 support / MODX Cloud DevOps team
11:25 AM - Data loss identified (possibility existed that the partition might be recoverable with fsck, but low confidence)
11:40 AM - Escalated to infrastructure provider, IBM Cloud Support
11:44 AM - MODX Cloud development begins disaster recovery process to rebuild Clouds on a blank server in worst case scenario with redundant Bare Metal Server quote request
11:48 AM - fsck operation begins to attempt to repair bad sectors in the disk array
12:07 PM - IBM Cloud reboots the server (in middle of fsck operation), fsck operation begins again
1:15 PM - fsck determined to be frozen
2:10 PM - IBM Cloud identifies problem attempting to boot to the RAID controller interface, transfers to data center (DC) team to physically inspect hardware
3:08 PM - IBM Cloud DC team begins inspection, physically reseating hardware and attempting a reboot
3:58 PM - DC team transfers ticket back to IBM Cloud Support
4:31 PM - IBM Cloud Support begins the investigation
6:48 PM - RAID controller determined to be failed by MODX Cloud DevOps team, despite being under 10% of its expected life based on its MTBF, and opens a ticket with IBM Cloud Support
7:44 PM - IBM Cloud recommends replacement of drives in RAID array with confirmation of complete data loss
8:49 PM - IBM Cloud transfers to DC team for drive replacement
8:58 PM - DC team begins drive replacements and HW integrity checks
11:22 PM - DC team confirms RAID card failure, replaces RAID card and initiates server rebuild
2:39 AM - OS reload completed, RAID arrays confirmed visible and optimal, transferred back to IBM Cloud Support to confirm that server configuration
3:29 AM - IBM Cloud support confirms that server is configured correctly
4:56 AM - MODX Cloud continues revising disaster recovery process/scripts for parallel execution of Cloud creation and backup restoration
11:19 AM - MODX Cloud starts final tests for rebuilding data from backups
12:24 PM - Disaster recovery process initiated, Production Clouds begin to be rebuilt from backups
4:38 PM - All Production Clouds restored and online, restoring Development Clouds begins
9:11 PM - Misconfiguration identified resulting in sites to have 50x errors, server rebooted
10:39 PM - Partition mount problem identified, data sync operation to the new partition, impacted Production Clouds start coming online again
4:11 AM - Data sync completed on the new partition, all previously restored Clouds back online
4:11 AM - Disaster recovery process re-initiated for remaining Development Clouds
9:28 AM - All remaining Development Clouds restored and online