Post mortem: browser checks failing to report results

Table of contents

What Are We Doing About This?
Lessons Learned
Timeline (CET)

A bug in our code and a gap in our error handling & alerting caused a very limited amount of browser checks — to our knowledge just one — to stop reporting results to our processing backend.

Although the impact seems limited, we did a full investigation and are sharing this post mortem.

Impact

As far as we can see, one customer was impacted for one check in their account. We did not receive any reports nor could find other checks impacted.

An error situation in a Playwright script began to generate extremely large error logs and subsequently overflowed the error object we normally pass back for further processing. Specifically, the payload was too large for the AWS SQS queue 256Kb size limit.

This means that the customer sees no results in their charts and logs and are also not alerted. The data is missing and effectively lost forever.

Trigger

We managed to reproduce this behaviour and are 99,999% certain it was caused by "out of control" error generation. This is most of the time harmless. However, we did not truncate the error correctly, as we do for other payloads we handle. This possible failure situation was always in our code and infra.

Resolution

Once we discovered the root cause and were able to reproduce the issue, we quickly designed a quick fix to truncate the responsible payload.

Detection

We had no effective detection mechanism or alerting in place for this situation. The customer informed us of the weird behaviour, e.g. missing check results data in their dashboard for a specific check.

What Are We Doing About This?

We've added the appropriate truncating to not hit any size limits anymore.
We have done are still doing some refactoring to make it easier for our systems to separate "user errors" and "platform errors". User errors here are just normal errors that can happen in any user submitted code.
We've fixed our Sentry logging on our runner infrastructure to correctly report these types of issues.
We've updated our own alerting to trigger notifications based on the errors.

08:10 - Start root cause analysis and diagnosing the issue.

11:30 - Sync call with full engineering team

16:00 - Successfully reproduced the issue

23-03-2021

11:00 - Enabled extra Sentry logging

16:00 - Deployed truncated error message fix, resolving the issue.

26-03-2021

Full day - Continued work on optimising our handling, logging and alerting on these and related errors.

Post mortem: browser checks failing to report results

Impact

Root Causes

Trigger

Resolution

Detection

What Are We Doing About This?

Lessons Learned

What went well?

What went wrong?

Where we got lucky?

Timeline (CET)