Platform Outage for Amsterdam 2 (paas2.ams.modxcloud.com)
Incident Report for MODX Cloud

MODX Cloud’s paas2.ams platform experienced a prolonged outage starting Friday, Jan 18. The following takeaways and action plan are provided so customers can understand what occurred and how we will act in the future for similar events.

We sincerely apologize for this outage, fully understand its critical importance, and welcome your feedback via support tickets from your MODX Cloud Dashboard.

Primary Issue and Actions Taken

It was ultimately determined that a RAID controller failure was the reason for the outage, an unfortunate and rare occurrence that was extremely unlikely to happen. This determination took multiple rounds of back and forth over approximately 16 hours to conclude with IBM Cloud, after which we were able to initiate our disaster recovery plan.

During this outage, some customers elected to self-recover by using the Dashboard to migrate sites to other data center locations or platforms in MODX Cloud.

After getting the usable hardware back from IBM Cloud’s Data Center and Support teams, we reprovisioned the server and performed a recovery of the platform from backups.

Related Issues

  • An issue was identified with all sites being rebuilt with PHP 7.1, causing 50x errors for the front-end on a subset of sites. Resolved by toggling PHP version to 7.1 and back to 5.6 in the Cloud Dashboard.
  • Customers with Let’s Encrypt certs who manually moved to alternate MODX Cloud platforms were unable to re-issue SSL certificates due to a security constraint in the Cloud API.
  • Dev Clouds in free legacy Lab accounts do not have automatic backups enabled from which to restore, requiring one-off restoration from a Snapshots if no backup was available.
  • Only files in web root are currently included in backups. See corresponding action item below.

Action Items

  • MODX Cloud will refund any domain charges for customers who manually moved sites to self-recover this week.
  • Policy and agreement to improve hardware failure escalation with IBM Cloud.
  • Update MODX account profile at IBM Cloud to require a phone call to MODX support team lead assigned as the incident lead when downtime resolution requires hardware replacement. This should occur immediately upon completion of hardware maintenance.
  • Improve communication frequency at https://status.modxcloud.com during long-running actions, like hardware replacements or validations.
  • Update disaster recovery automation to take each Cloud’s PHP version into account.
  • Address Let’s Encrypt provisioning/API for improved site portability.
  • Accommodate out-of-webroot file backups, which is currently a documented caveat.
  • Accommodate crontab entries, which were not part of the backups.
  • Accelerate release of a new DR Cloud option to prevent complete site outage in the event of total hardware (or data center) failure.
  • Review age/health of all disks and RAID controllers with IBM DC teams, and proactively replace as appropriate.
  • Reach out to legacy Lab account owners to remind about Backup policy.
  • Reach out to all Cloud users to encourage upgrading software and Extras for PHP compatibility due to pending PHP 5.6 EOL.
  • Practice disaster recovery twice a year to validate full compatibility with all updated features and products.

Timeline

January 18, 2018 (all times CST)

10:37 AM - Monitoring alerted us to site outage on p2a

10:39 AM - Tier 1 support investigation begins

11:10 AM - Escalated to Tier 2 support / MODX Cloud DevOps team

11:25 AM - Data loss identified (possibility existed that the partition might be recoverable with fsck, but low confidence)

11:40 AM - Escalated to infrastructure provider, IBM Cloud Support

11:44 AM - MODX Cloud development begins disaster recovery process to rebuild Clouds on a blank server in worst case scenario with redundant Bare Metal Server quote request

11:48 AM - fsck operation begins to attempt to repair bad sectors in the disk array

12:07 PM - IBM Cloud reboots the server (in middle of fsck operation), fsck operation begins again

1:15 PM - fsck determined to be frozen

2:10 PM - IBM Cloud identifies problem attempting to boot to the RAID controller interface, transfers to data center (DC) team to physically inspect hardware

3:08 PM - IBM Cloud DC team begins inspection, physically reseating hardware and attempting a reboot

3:58 PM - DC team transfers ticket back to IBM Cloud Support

4:31 PM - IBM Cloud Support begins the investigation

6:48 PM - RAID controller determined to be failed by MODX Cloud DevOps team, despite being under 10% of its expected life based on its MTBF, and opens a ticket with IBM Cloud Support

7:44 PM - IBM Cloud recommends replacement of drives in RAID array with confirmation of complete data loss

8:49 PM - IBM Cloud transfers to DC team for drive replacement

8:58 PM - DC team begins drive replacements and HW integrity checks

11:22 PM - DC team confirms RAID card failure, replaces RAID card and initiates server rebuild

January 19, 2018

2:39 AM - OS reload completed, RAID arrays confirmed visible and optimal, transferred back to IBM Cloud Support to confirm that server configuration

3:29 AM - IBM Cloud support confirms that server is configured correctly

4:56 AM - MODX Cloud continues revising disaster recovery process/scripts for parallel execution of Cloud creation and backup restoration

11:19 AM - MODX Cloud starts final tests for rebuilding data from backups

12:24 PM - Disaster recovery process initiated, Production Clouds begin to be rebuilt from backups

4:38 PM - All Production Clouds restored and online, restoring Development Clouds begins

9:11 PM - Misconfiguration identified resulting in sites to have 50x errors, server rebooted

10:39 PM - Partition mount problem identified, data sync operation to the new partition, impacted Production Clouds start coming online again

January 20, 2018

4:11 AM - Data sync completed on the new partition, all previously restored Clouds back online

4:11 AM - Disaster recovery process re-initiated for remaining Development Clouds

9:28 AM - All remaining Development Clouds restored and online

Posted 27 days ago. Jan 24, 2018 - 14:02 CST

Resolved
All Clouds should now be restored to their last backups. Please open a ticket from the MODX Cloud Dashboard if you require assistance with anything related to the paas2.ams platform. We will release a postmortem on this incident early next week and be reaching out to all customers with instances on the paas2.ams platform.
Posted about 1 month ago. Jan 20, 2018 - 18:22 CST
Update
At this time, production sites on this platform should be back online. If they are not, please contact support. We will continue restoring development clouds next.
Posted about 1 month ago. Jan 20, 2018 - 04:23 CST
Update
At this point, partition copy is at 95%.
Posted about 1 month ago. Jan 20, 2018 - 03:52 CST
Update
At this time, it's about 90% complete on the transfer.
Posted about 1 month ago. Jan 20, 2018 - 03:38 CST
Update
We apologize for a lack of update. Due to a misconfiguration, we had to take the services offline to migrate from one partition to another. At this time, it's about 85% completed.
Posted about 1 month ago. Jan 20, 2018 - 03:16 CST
Update
All production Clouds should now be restored; non-production Clouds without domains assigned are now being restored. We expect this to continue for the next ~18 hours. If you have any questions about sites that are not working as expected, please let us know by opening a support ticket in the MODX Cloud Dashboard. We will continue monitoring progress through the weekend.
Posted about 1 month ago. Jan 19, 2018 - 18:29 CST
Monitoring
Production sites with domains are being restored from the most recent backups to paas2.ams. We expect this to continue for the next few hours when sites without domains will start restoring. Please open a ticket from the Dashboard if you have any questions or require assistance.
Posted about 1 month ago. Jan 19, 2018 - 13:49 CST
Update
Over the last 8 hours, IBM Cloud data center engineers performed multiple rounds of hardware replacement including all drives and a RAID controller (which is supposed to handle redundant local storage). Due to the severity of the failures, our team had to reinstall the operating system and configure it as new.

The server is now back online and we have begun the process of restoring all customer sites from backups.

Sites with custom domains will be restored first, followed by development Clouds (with no custom domains activated).

We sincerely apologize for the unusual nature of the downtime. If you have any questions, please use the Help button from the lower right of the MODX Cloud dashboard to ask for assistance.
Posted about 1 month ago. Jan 19, 2018 - 10:49 CST
Update
We have worked with data center technicians overnight to replace all affected hardware including all disks and the RAID controller. We are preparing the platform for recovering sites and site functionality.

At this time we do not have an estimated time of completion.

To get your site back online quickly, we recommend you create a new Cloud at another location such as London, Frankfurt or Amsterdam 1, and restore your most recent backup or Snapshot into the newly created Cloud. Then add any custom domains and point your A Record to direct traffic to the new location. Once that is done you can install a free SSL certificate if needed and copy over your original web rules.

We sincerely apologize for the extended outage and will continue to update this incident as we have more information.
Posted about 1 month ago. Jan 19, 2018 - 05:39 CST
Update
Datacenter technicians are on site reviewing hardware functionality, including the RAID controller to confirm operation or issue a replacement.
Posted about 1 month ago. Jan 18, 2018 - 16:10 CST
Identified
We continue to run disk recovery operations with IBM. Due to the nature of this process, we do not have an ETA at this point.

To get your site back online quickly, we recommend you create a new Cloud at another location, and restore your most recent backup or Snapshot into the newly created Cloud. Then add any custom domains and point your A Record to direct traffic to the new location. Once that is done you can install a free SSL certificate if needed and copy over your original web rules.

We sincerely apologize for this outage and will continue to update this incident as we have more information.
Posted about 1 month ago. Jan 18, 2018 - 14:24 CST
Update
We're still working to restore service to Amsterdam 2 Platform with IBM Cloud engineers. At this time we expect the outage to continue for some time as we are working to confirm the integrity of hardware and data and identify the cause.
Posted about 1 month ago. Jan 18, 2018 - 12:47 CST
Update
We are working with our infrastructure partner, IBM Cloud, to restore service to our Amsterdam 2 Platform, as soon as possible.
Posted about 1 month ago. Jan 18, 2018 - 11:51 CST
Update
We continue to work to bring the Amsterdam 2 Platform back online.
Posted about 1 month ago. Jan 18, 2018 - 11:31 CST
Investigating
We're currently investigating the cause of an outage that's occurring on our Amsterdam 2 platform that will affect sites containing the Cloud URL of paas2.ams.modxcloud.com.

We're working with our upstream partner, IBM Cloud to identify the source of the issue and recover normal operations.
Posted about 1 month ago. Jan 18, 2018 - 11:04 CST