Texas 2 Platform Intermittent Outages
» View Event Details | Created
Some customers with sites hosted on our Texas 2 platform experienced intermittent availability between December 31, 2021 and January 3, 2022 with 502 and 504 errors. Some such states persisted for hours at a time.
While we monitor all platforms and a variety of individual websites, our monitoring showed no severe condition metrics, although server load and database IO was higher than previous levels.
We also noticed some customer sites experiencing intermittent downtime during their own backups. Our Engineering Team was unable to identify any specific site or sites causing the issues and therefore the best estimation was a combination of the older age and configuration of the platform and individual site performance issues which are fairly common.
As such, we developed a plan and started the process to migrate these sites to one of our new Gen2 platforms to mitigate the individual site issues. The new Gen2 platforms offer significantly improved hardware, an updated OS, higher IO, and the latest software stacks including PHP 8.
As our Support Team worked with customers to migrate impacted websites to Gen2 platforms, our Engineering Team looked for ways to optimize the configuration of the server to mitigate the issues and did find some improvements.
On January 4th, as the number of sites on paas2.tx was reduced, our Engineering Team was eventually able to detect and isolate sites with unusual activity—narrowing it down to an individual site that had a configuration not suited for multi-tenant environments. Once we removed the problem site, we immediately saw operations of the platform to improve well into the previous norms. We are confident that this has solved the issues causing problems with customer sites.
Individual Clouds do have constraints on certain types of resources to not interfere with other Clouds, however some database activity, under the wrong conditions, can negatively impact the performance of the entire platform, as was the case here.
What we’re doing to prevent this from happening in the future is:
* Revising the Terms of Service and Acceptable Use Policy to prohibit certain technologies from MODX Cloud.
* Implement newer monitoring to be able to catch specific sites negatively impacting the neighboring sites on a platform.
* Continue work on decommissioning older Gen1 platforms, in favor of our new Gen2 platforms available in various locations, such as NA-Central, NA-East, EU, and more.
We are currently working on the Dashboard tools for customers to self-migrate sites to new platforms (or any other platform in MODX Cloud) and when we begin the official migration campaign, that will be the process customers can use to move to the new platforms. If you would like help migrating your site now or have any specific questions about this report or the outage, please contact [email@example.com](mailto:firstname.lastname@example.org). We do not anticipate any further issues on this platform.