Learning from Outages: How CrowdStrike Can Improve Its Testing

TESTING, QA, TDD

In view of the recent IT outage caused by a bug from CrowdStrike the company pledged to improve its testing process, see BBC article "CrowdStrike to improve testing after 'bug' caused outage".

I was indeed wondering how the incident happened in such a mission-critical software delivery company where you would expect rigorous checks, instead of the usual "LGTM" PR comment, right? I am just speculating, of course, but I do know from experience that many teams take testing lightly, often as a second thought.

There are different phases of a project where varying levels of testing can be justified. For example, in a recent Proof of Concept (PoC) project, we opted for minimal testing to deliver quickly and validate the project's viability. Once we received approval, we revisited our pipelines and significantly improved all aspects of testing: unit testing, integration testing, overall coverage, accessibility, security, end-to-end testing, and more.

In reality, when you move into the "stable" phase of a project, your team should approach testing scenarios similarly to how acceptance criteria are written for user stories. If you are a developer, imagine you will be a QA or Tester for a day and test your colleague's ticket. You may find yourself stuck. To avoid this situation, teams can simply ask themselves: "How will we test this functionality?"

For this reason, teams should design their systems to be testable. This can be achieved by using mocks and stubs to recreate scenarios that need to be tested even before writing a single line of code. I don't mean TDD, which, while useful, is too narrow for specific unit testing and refactoring in general. I prefer a more holistic approach that begins during the Analysis and Design phase, not the Development phase.

Perhaps CrowdStrike glossed over some testing aspects, as many teams do. I often hear, "It's too difficult to test" or "It's a pain" simply because the design of the feature does not allow for simple testing or because teams didn't invest enough time in building the capability to test certain components. Some examples? Sure, I hear the following excuses all the time:

we cannot test this change locally

we have no end to end tests because the team didn't have expertise at the time, but now it is too big of a commitment.

VPN makes it difficult to test in the pipeline

or even

we can only test in X environment because we don't have test data available anywhere else

etc. In reality there is probably a solution to all of the above, but it requires a lot of effort, which probably would have been way less problematic to factor in at the beginning of the project.

As far as CrowdStrike goes, I can only imagine that some of these issues also apply. I may be wrong, but it's not hard to attribute the cause of the problem to a failure in testing the software appropriately.

Potential Testing-Related Issues:

So next we are looking at some hypothetical scenarios where testing-related issues may have caused the global IT outage.

Environment Configuration Discrepancies
- Scenario: Differences between testing environments and production environments could lead to undetected issues that only manifest under real-world conditions.
- Lesson: Ensuring that test environments closely mirror production environments helps in identifying configuration-related issues early.
Integration Testing Gaps
- Scenario: Potential issues in the integration between different components or services might have been missed.
- Lesson: Comprehensive integration testing is essential to ensure that all components work seamlessly together, especially in complex systems involving multiple services and dependencies.
Inadequate Failover and Redundancy Testing
- Scenario: Failover mechanisms and redundancy plans might not have been thoroughly tested or might have failed under real-world conditions.
- Lesson: Regularly testing failover procedures and redundancy mechanisms ensures that the system can recover gracefully from component failures, typically this also falls under lack of coverage where only certain happy path scenarios are tested.
Deployment and Rollback Issues
- Scenario: Failure in rollback mechanisms, or lack thereof, might have contributed to the outage.
- Lesson: Ensuring that rollback mechanisms are well-tested can help mitigate issues arising from new releases (or Windows updates in this case).
Monitoring and Alerting Gaps
- Scenario: There might have been delays in detecting and responding to the outage due to inadequate monitoring and alerting.
- Lesson: Implementing comprehensive monitoring and alerting systems helps in early detection and quicker resolution of issues, minimising downtime.
Neglecting Non-functional Testing
- Scenario: Issues related to performance, scalability, or security might have been overlooked during the testing phase.
- Lesson: Non-functional testing, including performance, scalability, and security testing, is as important as functional testing to ensure overall system robustness.

By addressing these areas (and I am sure they will) CrowdStrike can improve their resilience against outages, but we can also all learn from this incident and reflect on our team's testing practices.

Back to articles

Make it work

make it right

make it fast

Learning from Outages: How CrowdStrike Can Improve Its Testing

Potential Testing-Related Issues: