Post mortem: failing checks us-west-1

On 28-04-2020 from ~02:00 AM - 02:30 AM CET Checks failed to schedule due to downstream problems. Particularly the AWS SNS in the us-west-1 region seems to have been the issue. Checks were retried but still failed. This reports the check as a failure for that region.

Impact
Around 200 checks failed to run and reported an error. This was across all customers. The error message was similar to

503: null
at Request.extractError (/var/task/node_modules/aws-sdk/lib/protocol/query.js:55:29)
at Request.callListeners

Root Causes
The root cause seems to be very slow responses and eventual 503 errors returning from calls to AWS SNS.

Trigger
No specific trigger.

Resolution
The issue resolved itself.

Detection
Our logging reported elevated errors in the alerting daemon because it was missing specific check data. We did not detect the error directly.

What Are We Doing About This?

Handle infrastructure errors differently from "user" errors, e.g. scripts with programming mistakes.
Start alerting on elevated error rates for this scheduling logic to alert us on similar issues.

Timeline

28-04-2020

02:04 AM - cron daemon starts retrying and and failing to schedule checks
02:24 AM - cron daemon reverts to normal behaviour. Three customers notice strange behaviour and notify us through support channels.
05:58 AM - First employee in CET timezone sees emails and manually triggers alert.
06:05 AM - Primary does not acknowledge due to missing phone/voice alerting Secondary is paged.
06:30 AM - Analysis shows the issue has solved itself.
08:12 AM - Customers are notified by a retro active incident report.