On 28-04-2020 from ~02:00 AM - 02:30 AM CET Checks failed to schedule due to downstream problems. Particularly the AWS SNS in the us-west-1 region seems to have been the issue. Checks were retried but still failed. This reports the check as a failure for that region.
Around 200 checks failed to run and reported an error. This was across all customers. The error message was similar to
503: null at Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/query.js:55:29) at Request.callListeners
The root cause seems to be very slow responses and eventual 503 errors returning from calls to AWS SNS.
No specific trigger.
The issue resolved itself.
Our logging reported elevated errors in the alerting daemon because it was missing specific check data. We did not detect the error directly.
What Are We Doing About This?
- Handle infrastructure errors differently from "user" errors, e.g. scripts with programming mistakes.
- Start alerting on elevated error rates for this scheduling logic to alert us on similar issues.
- 02:04 AM - cron daemon starts retrying and and failing to schedule checks
- 02:24 AM - cron daemon reverts to normal behaviour. Three customers notice strange behaviour and notify us through support channels.
- 05:58 AM - First employee in CET timezone sees emails and manually triggers alert.
- 06:05 AM - Primary does not acknowledge due to missing phone/voice alerting Secondary is paged.
- 06:30 AM - Analysis shows the issue has solved itself.
- 08:12 AM - Customers are notified by a retro active incident report.