What really happened at CrowdStrike and why their proposed plan won't guarantee this can't happen again.

Yesterday, CrowdStrike released a post that contained a preliminary analysis of the technical causes of last week’s outage.

That post finally sheds some light on how this incident occurred. More importantly, it lays out how CrowdStrike is planning to make sure this can’t happen again.

CrowdStrike’s plan is totally inadequate in my assessment.

In this post, I make sense of CrowdStrike’s explanation of what happened here and explain why their plan going forward is insufficient and what they should be doing instead.

The first part of this post is essentially an expansion of a 280-character tweet I wrote earlier today:

What we already knew

We already knew that CrowdStrike is a cyber security company, one of whose products is Falcon. Falcon is software that runs on the computers of CrowdStrike clients, looking for signs of cyber attack.

In the case of computers running Microsoft Windows, Falcon is incredibly tightly integrated into the core of Windows. Specifically, Falcon runs as part of the most powerful and dangerous part of Windows: the kernel, which is the core of Windows. The Windows kernel is dangerous insofar as a bug in the Windows kernel can cause a computer to crash entirely— the dreaded Blue Screen of Death (BSOD). This is as opposed to ordinary application programs like Adobe Reader or Google Chrome, or whatever. A bug in Adobe Reader can cause it to crash, but is unlikely to interfere with other applications or with Windows itself.

We already know that on July 19, CrowdStrike automatically deployed an update to Falcon that triggered a bug. Because Falcon runs as part of the Windows kernel, that bug caused the kernel to crash and so the BSOD on the roughly 8.5 million Windows computers running Falcon around the world. Worse, after restarting, the bug re-triggered on reboot. So these computers had no way to reliably restart without manual intervention.

CrowdStrike had told us already that the update was not a code change. Nor was it a kernel driver (software code that runs as part of the kernel). So what was it? CrowdStrike have characterised it as a dodgy data file. Their newest post sheds further light.

What went wrong, plainly

IDS Rules (Rapid Response Content)

CrowdStrike’s new post confirms that it was indeed a dodgy data file that caused the problem. But saying that that data file was not code is somewhat misleading. Let me explain why.

Falcon does various things. One of its core functions, however, is to act as an Intrusion Detection System (IDS). To do that, it runs inside the Windows kernel which allows it (in theory) to monitor everything that is happening on the computer on which it is installed. After all, the kernel controls everything that happens on a computer. So by running inside the kernel, Falcon has the ability to see everything if it wishes to.

IDSs work by monitoring data streams (e.g. what files are being opened, what other computers if your computer talking to over the Internet) looking for signs of suspicious activity. They do so by applying detection rules which tell them what to look out for. For example, in the case of network intrusion detection, a detection rule might say to loook out for any communication that occurs with other computers on the Internet known to be controlled by attackers. You may have seen the recent advisory from the Australian Signals Directorate and partners about APT40. These kinds of advisories often contain information about Internet computers used by these attackers. In the case of APT40, reporting by the US Cybersecurity and Infrastructure Agency (CISA) notes several Internet computers (domain names) used by that group.

What I can deduce from the CrowdStrike post is that the outage was caused by a new IDS rule that they deployed on July 19. The post uses the term “Rapid Response Content”, which seems to be CrowdStrike’s version of IDS rules. It’s worth quoting from the CrowdStrike post:

Rapid Response Content is used to perform a variety of behavioral pattern-matching operations on the sensor using a highly optimized engine. Rapid Response Content is a representation of fields and values, with associated filtering. This Rapid Response Content is stored in a proprietary binary file that contains configuration data. It is not code or a kernel driver.

This tells us that the IDS rules for Falcon are stored in a special binary format, and are internally represneted as key/value pairs, much like other simple kinds of declarative programs. They instruct Falcon what to look for in the data streams that Falcon monitors.

To do the monitoring, Falcon acts as a rule interpreter. Essentially, IDS rules can be thought of as instructions. Falcon follows the instructions it is given to look for suspicious activity.

(Incidentally, therefore, I totally dispute CrowdStrike’s characterisation of Rapid Response Content as (non-code) data. These are IDS rules which are better thought of as code. All code is of course data.)

Rule Keyword Types (Template Types)

Quoting further, this seems to be confirmed:

Rapid Response Content is delivered as “Template Instances,” which are instantiations of a given Template Type. Each Template Instance maps to specific behaviors for the sensor to observe, detect or prevent. Template Instances have a set of fields that can be configured to match the desired behavior.

In other words, Template Types represent a sensor capability that enables new telemetry and detection, and their runtime behavior is configured dynamically by the Template Instance (i.e., Rapid Response Content).

While the layers of jargon have gotten only thicker, I would wager here that “Template Types” are similar in nature to different keyword types in traditional IDS rules. For example, just as (e.g.) the Suricata network IDS rule format has keywords to allow matching on (e.g.) the fingerprint of observed TLS certificates, the different “Template Types” seemingly allow Falcon to match on different kinds of observed behaviour.

Rolling out the new Template (Keyword) Type

CrowdStrike go on to explain that on February 28, they introduced a new Template Type to allow IDS rules (ahem, Rapid Response Content) to match on specific kinds of observed Inter-Process Communication (IPC) behaviour, specifically Windows named pipes. That of course required a code update to Falcon, to allow its rule interpreter to implement this new behaviour. We can only speculate but it is likely that the code update here would have included code for parsing (or understanding) instructions related to the new Template Type in the binary rule format, as well as code that interacted with the Windows kernel APIs to actually implement monitoring of named pipe activity (e.g. that would be triggered on named pipe creation, and so on).

A week later (March 5), CrowdStrike say they performed “stress testing” of the new IPC Template Type that they had rolled out code changes for a week earlier. This stress testing was presumably designed to make sure that there were no major issues in the new code that had been added on February 28.

I note as an aside that “stress testing” is a term most commonly used when carrying out performance testing rather than testing designed to uncover bugs. However, it is totally unclear what this testing entailed.

I am also baffled that the code change was rolled out before stress testing was performed, whatever it entailed.

Rolling out the rules

On the same day as the stress test, CrowdStrike say they released their first rule that made use of the new IPC Template Type (and so exercised the interpreter code for that rule that had been deployed on February 28).

In April, CrowdStrike then deployed three more new rules that used the new IPC Template Type, all without incident.

Rule Validation

Before getting to the rollout of the dodgy rule that caused so much havoc, it’s worth pausing to note that CrowdStrike mention that they did have a rule validator (or what they call a “Content Validator”).

Template Instances are created and configured through the use of the Content Configuration System, which includes the Content Validator that performs validation checks on the content before it is published.

The job of this rule validator seems to be to check that new rules that are being deployed (like those deployed in March and April that used this new feature) were valid. Exactly what “valid” means is not super clear but at a minimum it would typically include that the rule was not malformed, did not miss any required information or contain contradictory information, and so on.

Of course we know now the validator was faulty because if it had worked this incident wouldn’t have occurred.

The dodgy rule

Indeed CrowdStrike confirms so:

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

In other words, one of their rules was dodgy. But that wasn’t picked up by the rule validator because it, too, was also dodgy.

CrowdStrike also say that they had concluded that everything (the interpreter code for the new IPC Template Type, the rule validator, and the rules themselves) must have been OK, based on their problem-free usage up to that point.

It goes without saying that this was an incredibly naive assumption.

How can a dodgy rule cause a problem like this?

We don’t know. The following is pure speculation. There has been some online analysis that suggests that what might have happend is this.

The code in the interpreter for parsing (reading and understanding) the rules did not properly check that each rule contained all of the required information. The dodgy rule in question did not contain all of the information it should have. When reading the data from the dodgy rule into memory, the Falcon rule interpreter therefore didn’t put as much data into memory as it should have (because some of the data was missing), leaving uninitialised (junk) values in memory. Those junk values were later used when Falcon went to apply the rule, i.e. make use of it to look for suspicious IPC named pipe behaviour. This led to an invalid memory access (presumably because those junk values were pointers), and caused the Windows kernel to crash and BSOD.

Summary

On February 28, CrowdStrike deploys new code to their kernel driver that interprets the IDS rules, to enable IDS rules to match on IPC (Named Pipe) observations.

On March 5 they perform an internal “stress test” of that code. Then they deployed a new IDS rule that uses this new feature.

Three more rules were deployed in April that used this new feature.

The dodgy rule was deployed on July 19, alongside another one that was OK. The dodgy rule passed the rule validator, even though it should not have, and triggered a memory safety bug in the new interpreter code deployed on February 28.

CrowdStrike’s Plan is Totally Insufficient

CrowdStrike have laid out a plan for making sure this won’t happen again.

It looks totally inadequate to me. Too little, too late.

Their plan covers both changes in how they will deploy future updates, and how they will engineer them.

On the deployment side, they include staggered (aka gradual) deployment for IDS rules, guided by performance data collection including from first-stage internal “canary” staging, and allowing customers to control how IDS rule updates are applied (presumably to allow customers to opt-out of certain updates if they are concerned about stability impacts).

These deployment practices should be standard for critical updates (which all CrowdStrike IDS rule updates are because their interpreter runs in the kernel) and CrowdStrike ought to have been doing these already.

The software engineering practices they have said they will adopt are:

Local developer testing, Content update and rollback testing, Stress testing, fuzzing and fault injection, Stability testing, Content interface testing

These should all be mandatory for any in-kernel code and they should have been doing all of these already, too.

CrowdStrike have also said they will “add additional validation checks” to their rule validator, including one that “is in process to guard against this type of problematic content from being deployed in the future”. They have also said they will “enhance error handling” in the rule interpreter.

This is totally insufficient for in-kernel code. Adding additional checks will not guarantee that this kind of problem cannot re-occur in future. After all, (in the absence of the steps I outline below) how can CrowdStrike be sure that the additional checks they add are really sufficient to rule out future problems? Fuzzing new kernel code, or other forms of automated testing cannot guarantee this won’t happen again in future.

This is especially important because Falcon contains a lot of code that runs in the kernel. Patrick Gray noted on yesterday’s Risky Business podcast that the Falcon kernel is over 4MB in size.

What Can CrowdStrike Do Instead?

Instead, CrowdStrike should (in order of priority):

  1. Move as much of Falcon out of the Windows kernel as possible. I said the same thing on The Conversation days ago.

  2. Formally specify what valid rules look like.

  3. Formally verify that the resulting, minimal, driver is memory-safe when it interprets valid rules.

  4. Formally verify that the rule validator is sound: any rule that it says is OK is valid.

These things are all totally do-able using 21st century safety critical software engineering methods.

CrowdStrike: if you’re reading this and want to know more, you know where to reach me.