Post mortem: checks with async IIFE reporting success incorrectly

Between 05.06.2020-12.06.2020, checks using the async IIFE syntax had runs marked as passed when, in reality, they were not correctly executed.

Impact

We detected 18 active checks which were affected.

Root Causes

Changes related to additional security measures for Browser checks changed the default behavior of the runner. These changes affected the way we handle promises which are not awaited or returned.

Resolution

Instead of exiting the process when the execution block was finished, we let the node process exit after it executed all promises.

Detection

A customer contacted us after their checks didn't detect an outage they were having.

What Are We Doing About This?

  • We pushed the fix immediately on Friday after it was reported.
  • We added fixtures to our test suite with async IIFE syntax and added checks to staging and production suites.
  • We set up paging capabilities to these checks in case anything avoids our unit tests.

Timeline

05.06.2020

  • 12:00 security changes were rolled out to half of the regions

08.06.2020

  • 11:00 security changes were rolled out to all regions

12.06.2020

  • 15:03 We got informed by a customer that their tests were passing without printing certain logs
  • 15:13 We found the root cause of the issue and offered the customer a workaround
  • 15:30 We implemented a quick fix for the issue
  • 18:30 After talking to the customer, we decided that the issue had to be resolved ASAP
  • 19:00 We started testing the fix
  • 19:20 We pushed the fix to production and started observing the stats
  • 20:00 We declared the incident resolved