Outage browser check results & alerting

Table of contents

What Are We Doing About This?
Lessons Learned
Timeline (UTC)

Monday 18 May we had an outage in processing browser check results between 15:44 PM UTC and 20:38 PM UTC. This was caused by a bug in our release and deployment software. API checks were not impacted.

Impact

For around 5 hours, we did not record any results for our browser checks. This means we also failed to alert for all failed browser checks (~59 checks across ~31 customers).

Root Causes

We recently made a changes to our deployment pipeline to avoid accidental "canary flags" which we use to run new code in production without exposing it to end users. This change was accompanied by another code change which wasn't included in the original pull request. We shipped a part of the solution, but not the full solution.

We've fixed the code that checks the canary mode. We also already have an open PR with more checks in the deployment process itself to avoid these issues.
No critical deploys after 12:30 (CET): We decided to not do any deployments after lunch hours in Berlin. This won't be a permanent measure but will be in place for the foreseeable future. This will help us find possible failures before since there'll be people on our team using many parts of Checkly & Intercom is monitored.
Add more Checkly checks: We'll create API checks for the check results for our own checks that checks the time of the most recent results.
Add specific CloudWatch alarms: We are adding anomaly detection on AWS for this specific pipeline in our infrastructure. A dip in results being published to our pipeline will now trigger an alert.

Lessons Learned

What went well?

Our new deployment pipeline is much faster, so deploying the hot fix was relatively quick.
Person on call was also the one who made the change so the fix was ready relatively fast after it was noticed.

What went wrong?

We had no monitoring/alerting for the case of browser check runners not publishing results
Because the browser check results are a relatively small part of the total results our other monitoring/alarm setups were not triggered either.

Where we got lucky?

Team member noticed a customer chat on Intercom.

Timeline (UTC)

15:44 - Deployment was triggered that included the faulty code
20:32 - Alarm was triggered
20:38 - Hotfix deployment started
20:42 - Issue resolved

Post mortem: outage browser check results & alerting

Impact

Root Causes

Trigger

Resolution

Detection

What Are We Doing About This?