Issues with Amsterdam 1 server (paas1.ams)
Incident Report for MODX Cloud

Root cause

  • RAID controller hardware failed, affecting multiple arrays, segments and drives

Steps we took to identify the issue

  • 8:48 AM - our monitoring started alerting us to unresponsive sites
  • Mike (MODX Cloud developer) noticed disk i/o errors and slowdowns while investigating
  • Escalated to Elizabeth (MODX Cloud ops) who determined RAID issues
  • Escalated to IBM/SoftLayer (server and networking provider) support
  • 10:28 AM - IBM/Softlayer data center server build technician made physical diagnosis on the hardware

Who and what services were affected

  • Customers with Clouds on the paas1.ams server were affected
  • Websites were intermittent for some time, and completely offline while the hardware problem was being resolved

How did it get fixed

  • The RAID controller hardware and cables were replaced
  • Multiple arrays, segments and drives were automatically rebuilt by the RAID controller. This was done offline to reduce stress on the controller and decrease likelihood of data loss. This is the operation that took the most time (about 5.5 hours).
  • A final segment of one array was rebuilt with the server online

What did we learn, how will that affect our process, product and strategy

  • Hardware failures do happen. Even very very unlikely ones.
  • In this type of failure, we are very susceptible to unacceptably long outages, particularly on our very largest and most powerful servers.
  • We have already proactively started building our servers smaller, such as the ones in the Frankfurt, London and Sydney data centers. This both minimizes the number of customers impacted by this kind of event, and makes RAID rebuilding operations take less time if they do occur. This event has changed how we will be planning and prioritizing changes in our Amsterdam and Texas data centers. (For example, these larger servers may ultimately be decommissioned sooner as we place customers onto a higher number of smaller servers.)
  • While we do have a process for moving a site from one server or datacenter to another, it involves many steps, and may not be easy to accomplish quickly, particularly during an emergency. There are a few things we will consider improving here, from documentation, to tools that we use ourselves to accomplish administrative tasks, to putting additional automation tools in the MODX Cloud dashboard to assist in moving sites around more easily.
  • Though we strive to be transparent, honest and truthful in representing the multi-tenant, single-server nature of the MODX Cloud product, we want to do better. We will be reviewing our marketing copy and looking at our customer interactions particularly during sign-up and onboarding. We do not want to give customers a false impression that they are buying a fully redundant, automatic failover type of cloud hosting.
  • While we do offer solutions for failover and high availability on the MODX Cloud infrastructure today, they are typically created for large sites. These are more costly and complex solutions that do not fit in the budget or scope of most customers. We are accelerating the research and development (already underway) of a more affordable solution, or range of solutions designed to keep sites online should a server go offline due to hardware failure or network interruption.
Posted about 2 months ago. Jul 27, 2017 - 22:51 CDT

Resolved
No further issues have surfaced while monitoring the paas1.ams server since it was put back into production at 16:43 PM U.S. Central time.
Posted about 2 months ago. Jul 26, 2017 - 22:35 CDT
Monitoring
The paas1.ams server is back online, serving websites.

Details: The new RAID controller on this server has completed rebuilding the inconsistent array. We have done additional checks, particularly for consistency across all database tables. There has been no data loss.

We will continue to monitor the server closely, and will follow up with a postmortem report on today's outage.
Posted about 2 months ago. Jul 26, 2017 - 16:51 CDT
Update
The RAID arrays are now back online. We're running checks on our data and environments and hope to have the platform back online soon.

We apologize, as prolonged downtime events such as this are obviously distressing and inconvenient for you and for your clients and users. We take uptime very seriously and we will do better in the future.
Posted about 2 months ago. Jul 26, 2017 - 16:29 CDT
Update
Our current ETA for the completion of rebuilding/verifying the RAID disk array is 2-4 hours. There is a possibility of further downtime once that rebuild is done, if server build technicians determine that a physical disk needs to be replaced.

In many cases, moving your production site(s) to another server may be the faster option for getting back online. Please contact support, and we will do our best to assist. We also have a guide for doing this here: https://support.modx.com/hc/en-us/articles/220259027

We apologize, as prolonged downtime events such as this are obviously distressing and inconvenient for you and for your clients and users. We take uptime very seriously and will be learning from this incident so that we will do better in the future.
Posted about 2 months ago. Jul 26, 2017 - 12:45 CDT
Update
We continue to work with IBM/Softlayer's data center team to resolve this issue. The data center team has the server placed back in the rack, the operating system is up and the disks are in the process of rebuilding/restoring.
Posted about 2 months ago. Jul 26, 2017 - 11:27 CDT
Update
We continue to work with IBM/Softlayer's data center team to resolve this issue. The data center team has replaced the RAID card and cables and the disk array is in the process of rebuilding.
Posted about 2 months ago. Jul 26, 2017 - 10:58 CDT
Update
The issue has been identified as a hard drive failure, and IBM/Softlayer's data center team are working on resolving the issue.
Posted about 2 months ago. Jul 26, 2017 - 09:30 CDT
Identified
We've identified the cause to a harddrive in the RAID failing. We've contacted our upstream to resolve the issue ASAP.
Posted about 2 months ago. Jul 26, 2017 - 09:17 CDT
Investigating
We are investigating i/o issues on the paas1.ams server. This issue is affecting website uptime of websites hosted on the platform and is therefore receiving urgent attention to identify and resolve the issue.
Posted about 2 months ago. Jul 26, 2017 - 09:03 CDT