Updated: 2024-07-21 08:45 AM AEST: Clarifying that the Microsoft Azure outage in its Central US region began before the CrowdStrike update was pushed and so, despite claims in The Conversation from an Australian academic to the contrary, that outage was seemingly not caused by the CrowdStrike update.
Updated: 2024-07-21 07:08 AM AEST: in the wake of blog posts from Microsoft, and CrowdStrike.
In the last 24 hours there has been much written about the massive outage currently affecting Windows computers worldwide. This post is a quick attempt to cut through much of the confusion, to clear up some common misconceptions, and offer some explanations (all too often absent elsewhere).
If you’re reading this post you’ll no doubt already be aware that the outage has been caused by an update to the widely-deployed CrowdStrike Falcon software.
As I explained in my piece for The Conversation yesterday,
CrowdStrike is a US cyber security company with a major global share in the tech market. Falcon is one of its software products that organisations install on their computers to keep them safe from cyber attacks and malware.
Falcon is what is known as “endpoint detection and response” (EDR) software. Its job is to monitor what is happening on the computers on which it is installed, looking for signs of nefarious activity (such as malware). When it detects something fishy, it helps to lock down the threat.
Who is to blame?
Almost certainly not Microsoft
We have seen some reporting suggesting this is a Microsoft problem. It is true that it is affecting Microsoft (Windows) computers. But at this stage this does not look like Microsoft’s fault. The waters have been muddied by a simultaneous Microsoft Azure outage.
Updated 2024-07-21 08:47 AM AEST: It seems clear that the cause of the Azure outage in the Central US region was not the CrowdStrike update. Microsoft reports that the Azure outage started at 21:56 UTC on July 18. CrowdStrike says it pushed the faulty update at 04:09 UTC on July 19, about 6 hours after the Azure outage began.
I am therefore mystified by reporting in The Conversation published yesterday from an Australian academic that claimed:
As it has now turned out, the entire Azure outage could also be traced back to the CrowdStrike update.
Given the timeline above, I don’t see how that can be the case.
How can an update cause so many computers to break?
Updated: 2024-07-21 07:08 AM AEST: Microsoft estimates that the outage has impacted 8.5 million Windows machines, “less than one percent of all Windows machines”. That might not sound like a lot. (Indeed, some have quipped that being at less than 1%, the impact of this event is not even a tenth that of the 1988 Morris worm that was estimated to have impacted 10% of the Internet connected machines at the time.) However, when you consider the kinds of places where EDR is deployed the disruption makes sense. EDR is deployed by security-conscious organisations on their critical assets to help protect them from attack. As I explained to The Conversation, CrowdStrike is not typically deployed on home PCs, for example. That means this outage was always going to target the more critical systems which is why its impacts were so large even if the total proportion of impacted Windows systems seems low.
Because CrowdStrike Falcon is so widely deployed
Firstly, as the scope of the outage tells us, Falcon is incredibly widely-deployed. Cyber security professionals have spent the past two decades convincing the IT industry that EDR technology is necessary to keep their organisations safe from cyber attack. If you read the guidance offered to organisations about how to secure themselves, including Government-issued guidance like the Australian Government’s Essential Eight or similar guidance offered by cyber security vendors, you’ll see that EDR is oft recommended. Many companies have listened, and so have adopted this technology.
It must be stressed that overall, this is a good thing. A world in which EDR technology is widely-deployed is one that is more secure, on balance, than without. However, this incident highlights how EDR vendors—like all software vendors—need to be careful when deploying automated updates.
Because the CrowdStrike Falcon update was deployed automatically
CrowdStrike auto-updates its software not only to ensure that the latest security fixes are applied (the same reason that Microsoft Windows auto-updates itself, as do so many other pieces of modern software). Falcon, being a security solution, also auto-updates itself for the same reason that anti-virus software does: to make sure that it can detect the latest threats. This is of course a good thing for security. Much like EDR itself, software that automatically updates itself keeps us all more secure.
However, software updates are not without risk.
Software going wrong when a bad update is automatically deployed is not especially uncommon. For this reason, software vendors carefully test updates before deploying them to customers.
Some vendors will deploy updates to small subsets of customers to make sure nothing goes wrong when they are deployed at scale, before deploying them to their entire customer base.
Clearly, something has gone wrong in this process for the latest CrowdStrike update. It should have been carefully tested. Ideally, it would have been pushed out to only a small fraction of users before being more widely-deployed, which would have allowed this issue to have been caught without causing so much damage.
Because CrowdStrike Falcon is a highly privileged piece of software
All software vendors have a responsibility to make sure their updates won’t break things. This is especially true for highly-privileged software like CrowdStrike Falcon. As I tried to explain to The Conversation in order for Falcon to do its job, it has to be tightly integrated into the deepest layers of Microsoft Windows. In particular, Falcon runs inside the Windows kernel, which is the most privileged and critical part of the Windows operating system.
A bug in Falcon therefore has the potential to bring down Windows in its entirety. A really bad kernel bug will manifest itself again when when a computer is restarted. This is exactly what we’ve seen with this bug and why it has bricked so much of the world’s computers, rendering them inoperable.
Updated 2024-07-21 07:25 AM AEST: CrowdStrike have provided some partially intelligible details consistent with this analysis.
Each time a computer that has this faulty update is restarted, the kernel again reads the faulty update file causing it to crash before the machine has had a chance to properly boot up. This has made it very difficult for system administrators to fix the problem without manually intervening on each affected computer—an incredibly time-consuming process.
That is the reason that while this problem was caused by a remote, automated software update, it is not so easy to fix it automatically.
What are the likely impacts of this outage?
Very bad
This outage is unprecedented in its scale and severity. Many are calling it the worst outage of all time.
Because it cannot be easily fixed automatically, it is going to take time before it can be fully resolved.
In the meantime, as I said on twitter earlier today, the consequences are likely to be severe.
It doesn’t take much disruption to an already over-stretched hospital or emergency response system to cause major harms. That is the reason that when automatic ambulance dispatch systems fail, people die. It is the reason that when ransomware infects hospitals, people die.
We should absolutely expect similar outcomes here, therefore. Given the massive scale of this outage, which is already impacting hospitals and emergency dispatch systems I fully expect the consequences to be catastrophic. I don’t think it is exaggeration or hyperbole to say so. Time will tell, however.
In the meantime, every sysadmin or IT team that is scrambling to lessenn the impact of this incident deserves our praise.
I’ve heard that rebooting 15 times might help. WTF is that about?
Because your computer is trying to win a race (really)
This advice has now been widely reported. It sounds ridiculous but in some cases it does make sense.
CrowdStrike has deployed a fix for the problem. In theory, if an affected computer can stay up and running long enough to receive the fix over the network, then the problem can be resolved.
What happens, however, is that when an affected computer starts up, it will often crash before it has had a chance to download the new update from CrowdStrike, preventing the update from being applied.
Rebooting many times increases the small chance that at some point your computer will be able to download the update before it crashes.
When an affected computer starts up it is effectively in a race, between whether it manages to download and apply new fix from CrowdStrike, or whether it instead crashes before it has a chance to do so.
This is the reason this advice is most effective for computers that can quickly establish an Internet connection when they start-up, like those using wired Ethernet networking as opposed to WiFi connections that take longer to establish.
Should organisations stop mandating EDR?
Probably not
Some (notably some computer science academics) have advocated or implied that organisations should not mandating the use of EDR. Some folks (perhaps tongue-in-cheek) equate EDR products like CrowdStrike to rootkits or to spyware, or are otherwise suspicious of it. This is because, as I explained earlier, these tools are highly-privileged.
EDR products by design need to be able to monitor what your computer is doing (much like spyware, or even traditional anti-virus).
We should demand the highest standards of safety and security from software vendors who write highly-privileged software. Security vendors are no exception—in fact they ought to be producing the most secure software (even if history tells us otherwise.
But, as I mentioned earlier, a world in which no organisation deploys EDR is one in which we are all much less safe. For just one example of why: EDR solutions are the foundation of much work in incident response. When a cyber intrusion occurs, investigators need to have high quality evidence of what has occurred. EDR solutions are one way that such evidence is gathered. Without them, cyber defenders would have much lower quality information about the threats their organisations are facing. High quality threat intelligence is crucial for strong cyber defence. EDR is a crucial piece of that puzzle—even if you discount its role in helping to directly address threats as they arise.
Updated 2024-07-21 07:19 AM AEST: Thank you to Adrian Herrera who pointed out CrowdStrike’s value as a threat intelligence service. Much of that information comes from its sensors like Falcon deployed all over the organisations that provide a collective early warning signal about emerging cyber threats. (I have no idea but I have always assumed that that is why the company is called CrowdStrike, a reference to the wisdom of the crowd and crowdsourcing, in this context for threat intelligence.)
How did CrowdStrike fail so badly here?
Updated 2024-07-21 07:31 AM AEST: CrowdStrike’s post and other reporting implies that they did not perform adequate quality assurance testing on the offending update because it was not a code change but rather a configuration change that triggered a software bug. It seems likely that CrowdStrike were operating under the assumption that configuration changes could not cause failures and therefore did not need to be subject to the same kinds of quality assurance processes as code changes. Clearly this assumption was wrong. (Indeed, computer science tells us that data can do as much damage as code, because it influences the behaviour of code.)
We’ll have to wait for more details from CrowdStrike to be sure.
I’ll endeavour to keep this post updated as this story develops.
In the meantime, buy your CISO or IT team a coffee. They’re in for a rough weekend.