Crowdstrike debacle for Microsoft Windows Update

  • Thread starter Wrichik Basu
  • Start date
In summary, the recent Crowdstrike debacle for Microsoft Windows Update involved an error in a software update that caused users' systems to crash and display the "blue screen of death." This resulted in widespread frustration and criticism towards Microsoft and Crowdstrike for the mishap. The issue was eventually resolved, but it brought attention to the potential risks and vulnerabilities of software updates and the importance of proper testing and quality control.
Computer science news on Phys.org
  • #2
CrowdStrike was the source of today's problem, crashing customers' services in the Windows cloud system.

Who is daft enough to send out an update to all its customers worldwide simultaneously without testing it? CrowdStrike should have sent the update to one (small) region's customers, then waited for news of any problems when installed.

Or very very much better still, just tested it on a few machines of their own held in the cloud just to see what happens!
 
  • Like
Likes russ_watters
  • #3
Who says they didn't test it?
 
  • #4
Vanadium 50 said:
Who says they didn't test it?
Could it be causing this big a problem if it had been tested reasonably well? Bugs can get through any testing, but it seems strange that this bug took so many computers are down and didn't get detected in tests.
 
  • #5
I am willing to believe it wasn't tested adequately. The mess speaks for itself. I am questioning the claim that it wasn't tested at all. That some computers were affected and others were not seems to me to be suggestive and significant.
 
  • Like
Likes phinds, Astronuc and FactChecker
  • #6
Vanadium 50 said:
I am willing to believe it wasn't tested adequately. The mess speaks for itself. I am questioning the claim that it wasn't tested at all.
Yes, I agree. I can't believe that a software company wouldn't organizationally have testing requirements.
Vanadium 50 said:
That some computers were affected and others were not
I didn't know that.
Vanadium 50 said:
seems to me to be suggestive and significant.
Right.
 
  • #7
FactChecker said:
I didn't know that.
As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)

Such machines are configured differently - CAD stations tend to be newer and better provisioned. The "business" machines tend to have less memory, integrated graphics, fewer cores, etc. So maybe there is something there. Maybe its just coincidence.
 
  • #8
Vanadium 50 said:
As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)

Such machines are configured differently - CAD stations tend to be newer and better provisioned. The "business" machines tend to have less memory, integrated graphics, fewer cores, etc. So maybe there is something there. Maybe its just coincidence.
I mean...maybe some use Crowdstrike and some don't? Do we actually know if it did or didn't affect all updated Crowdstrike pcs? Google tells me half of the Fortune 500 use it.

Or...are CAD machines more likely than business machines to be powered on at 2am?
 
  • #9
I don't work in IT so can't tell you. I can say that all machines are centrally managed, and are supposed to stay on at night. Whether engineers are more compliant than accountants, I can't really say.

It strikes me as unlikely that IT would decide that this machine needs protection and that one does not, but it is a logical possibility. All I know for sure is that not every machine was impacted, and there seems to be a correlation between which ones were and were not hit. (And we all know about correlation and causality)
 
  • #10
Vanadium 50 said:
Who says they didn't test it?
They uploaded three files full of zeros!

See the above for an explanations. Do you think it was tested properly?
 
  • Like
  • Wow
  • Informative
Likes Borg, Astronuc and FactChecker
  • #11
LONDON (Reuters) -A software bug in CrowdStrike's quality control system caused the software update that crashed computers globally last week, the U.S. firm said on Wednesday, as losses mount following the outage which disrupted services from aviation to banking.

The extent of the damage from the botched update is still being assessed. On Saturday, Microsoft said about 8.5 million Windows devices had been affected, and the U.S. House of Representatives Homeland Security Committee has sent a letter to CrowdStrike CEO George Kurtz asking him to testify.

The financial cost was also starting to come into focus on Wednesday. Insurer Parametrix said U.S. Fortune 500 companies, excluding Microsoft, will face $5.4 billion in losses as a result of the outage, and Malaysia's digital minister called on CrowdStrike and Microsoft to consider compensating affected companies.

The outage happened because CrowdStrike's Falcon Sensor, an advanced platform that protects systems from malicious software and hackers, contained a fault that forced computers running Microsoft's Windows operating system to crash and show the "Blue Screen of Death".

"Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data," CrowdStrike said in a statement, referring to the failure of an internal quality control mechanism that allowed the problematic data to slip through the company's own safety checks.
https://finance.yahoo.com/news/crowdstrike-says-bug-quality-control-104409302.html
https://news.yahoo.com/news/finance/news/crowdstrike-says-bug-quality-control-104409772.html

DrJohn said:
Do you think it was tested properly?
Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?

https://en.wikipedia.org/wiki/P-code_machine#Microsoft_P-Code

https://learn.microsoft.com/en-us/windows-hardware/drivers/install/installing-a-boot-start-driver
 
Last edited:
  • #12
Astronuc said:
Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?
It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.
That being said, it also sounds like CrowdStrike had done shockingly little testing, if any.
 
  • #13
FactChecker said:
It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.
That being said, it also sounds like CrowdStrike had done shockingly little testing, if any.
From https://www.cnn.com/2024/07/24/tech/crowdstrike-outage-cost-cause/index.html:
CrowdStrike said that the testing and validation system that approved the bad software update had appeared to function normally for other releases made earlier in the year. But it pledged Wednesday to keep software glitches like last week’s from happening again, and to publicly release a more detailed analysis when it becomes available. The company added that it is developing a new check for its validation system “to guard against this type of problematic content from being deployed in the future.”

Also, their proposed new "approach" sounds way overdue:
And CrowdStrike said it also plans to move to a staggered approach to releasing content updates so that not everyone receives the same update at once, and to give customers more fine-grained control over when the updates are installed.

As for the cost of the failure:
What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.

Are CrowdStrike's pockets deep enough if they're held financially liable for this?
 
  • #14
renormalize said:
CrowdStrike said that the testing and validation system that approved the bad software update had appeared to function normally for other releases made earlier in the year. But it pledged Wednesday to keep software glitches like last week’s from happening again, and to publicly release a more detailed analysis when it becomes available. The company added that it is developing a new check for its validation system “to guard against this type of problematic content from being deployed in the future.”
It sounds like they need an independent partner, perhaps Microsoft, to do testing, or design and implement a more robust testing method. It seems whatever they did was somewhat unorthodox with respect to drivers in the Windows kernel if I understand Dave Plummer's explain in the video above. Maybe CS should hire Dave!

Engadget explanation - CrowdStrike blames bug that caused worldwide outage on faulty testing software
The faulty update caused an out-of-bounds memory read that triggered an 'unrecoverable exception.'
https://www.engadget.com/crowdstrik...age-on-faulty-testing-software-120057494.html
CrowdStrike has blamed faulty testing software for a buggy update that crashed 8.5 million Windows machines around the world, it wrote in an post incident review (PIR). "Due to a bug in the Content Validator, one of the two [updates] passed validation despite containing problematic data," the company said. It promised a series of new measures to avoid a repeat of the problem.

The problem forced Windows machines into a boot loop, with technicians requiring local access to machines to recover.

To prevent DDoS and other types of attacks, CrowdStrike has a tool called the Falcon Sensor. It ships with content that functions at the kernel level (called Sensor Content) that uses a "Template Type" to define how it defends against threats. If something new comes along, it ships "Rapid Response Content" in the form of "Template Instances."

A Template Type for a new sensor was released on March 5, 2024 and performed as expected. However, on July 19, two new Template Instances were released and one (just 40KB in size) passed validation despite having "problematic data," CrowdStrike said. "When received by the sensor and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."

To prevent a repeat of the incident, CrowdStrike promised to take several measures. First is more thorough testing of Rapid Response content, including local developer testing, content update and rollback testing, stress testing, stability testing and more. It's also adding validation checks and enhancing error handing.
. . . .
 
Last edited:
  • #15
Delta has to manually reset each computer system affected by the mass IT outage
https://news.yahoo.com/news/delta-manually-reset-computer-system-105716336.html

As most airlines began to recover following the technology issues, Delta has struggled to restore operations to full capacity, now resorting to manually resetting each system affected by the meltdown.

In a statement on Monday, Delta said over half of its IT systems globally are Windows-based.

"The CrowdStrike error required Delta's IT teams to manually repair and reboot each of the affected systems, with additional time then needed for applications to synchronize and start communicating with each other," the statement said.

It explained that the airline's crew-tracking-related tool, which ensures flights are fully staffed, has needed manual support to repair.

On Sunday, Delta CEO Ed Bastian said the critical tool could not effectively process the number of changes prompted by the Microsoft Windows operating system's shutdown.
Ouch!
 
  • #16
My immediate reaction is there needs to be more players in the space than crowdstrike.
 
  • #17
My immediate reaction: in gardening/botany it's common knowledge that too many clones in/from the greenhouse is just asking for trouble.
 
  • #18
Too many clones almost wiped out bananas. But the conclusion was that the thing to do was to switch from one clone to another. The problem with the "we need more products" is how you tell people they can't use their preferred product "for the common good".

The problem is that we want two incompatible things - we want instantaneous protection from new attacks, ideally before Day Zero, and we want every bit of code tested for days or weeks before rolling it out. Can't have both.

There is a design feature of Windows (and unix) that makes this worse. Windows has two protection rings: Ring 0, or kernel mode, and Ring 1 (which is actually Ring 3) which is user mode. Ring 3 problems crash the app. Ring 0 problems crash the system. ClownSrtike...er...CloudStrike needs to examine everything that is running in Ring 3, so Windows needs to run it in Ring 0.

However, it only needs read-access. You could, if Windows supported it, run it in Ring 2, giving it read-only access to everything that was running. Then if it crashed, it wouldn't take everything with it. You'd be running unprotected, which is bad, but you'd be running.
 
  • #19
CrowdStrike has now admitted that deploying their software worldwide in one go is a bad idea, and they will now switch to deploying it in stages, so the whole world will not go down in one go.

Gosh what a clever idea. One that I suggested as soon as I read they had deployed it worldwide in one go. Perhaps they will even follow my other suggestion - rent a bunch of machines in the cloud for their own use, and deploy only to that group for testing, before it goes out to everyone else. These machines in the cloud could be as simple as a few test databases and a few test webservers that then go through some tests to check they are all working the same way as last week.
 
  • #20
And if the machines not at the head of the line get infected while they are waiting their turn is that also CloudStrike's fault?
 
  • #21
One person on Twitter said that the issue was due to referencing a non-existent memory address in the underlying C++ code:



CrowdStrike also releases software for Linux and MacOS. However, thanks to CrowdStrike already adopting eBPF on Linux, such issues are much less likely to occur. This article throws some light on this aspect: https://brendangregg.com/blog/2024-07-22/no-more-blue-fridays.html

Microsoft's eBPF support for Windows is not yet production-ready.
 
  • #22
Wrichik Basu said:
Microsoft's eBPF support for Windows is not yet production-ready.
Production ready? Like CrowdStrike was?

Wrichik Basu said:
One person on Twitter said that the issue was due to referencing a non-existent memory address in the underlying C++ code
Normally I comment on "some guy on Twitter" as being a not-so-reliable source, but a) in this case he is right, and b) the vast majority of crashes are from uninitialized or otherwise bad pointers being dereferrenced, so it's a no-brainer.

In this case, there is a pointer that is a base plus an offset, but the base was never initialized, so it points nowhere. When the pointer is attempted to be dereferenecd, it goes boom.
 
  • Informative
  • Like
Likes Astronuc and berkeman
  • #23
I'm wondering if this was a "parting gift" from a disgruntled employee leaving the company.
 
  • #24
They may wish so, but I really doubt that.
 
  • Like
Likes Vanadium 50
  • #25
rcgldr said:
I'm wondering if this was a "parting gift" from a disgruntled employee leaving the company.
Never attribute to malice what can be explained by incompetence; the reason for this is that there are limits to malice.
 
  • Like
Likes Nik_2213
  • #26
Vanadium 50 said:
Production ready? Like CrowdStrike was?
Unfortunately, I am not an employee in Microsoft, so can't comment on details.
 
  • #27
https://www.yahoo.com/news/crowdstrike-outage-latest-signal-united-160300130.html

CrowdStrike says it’s not to blame for the recent flight chaos in the U.S.

The cybersecurity firm said Sunday it had minimal legal liability over the disruption in mid-July.

Over the weekend, CrowdStrike reiterated its apology.

But in a letter from a lawyer, it also said it was disappointed by any allegation that it was negligent.

The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.

Now it says the carrier should explain why it turned down free onsite help, and why rival airlines were able to get their systems back online much more quickly.
Don't blame the customer for one's negligence.
 
  • #28
Astronuc said:
The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.
They sent an email, I guess?:eek:

CrowdStrike says it’s not to blame for ...
had minimal legal liability ...
Those statements are not even on the same axiso0)
 
  • Haha
Likes Vanadium 50, berkeman and Astronuc
  • #29
Astronuc said:
Don't blame the customer for one's negligence.

"It's your fault, customer. You should have picked a reputable vendor."
Not exactly a winning argument.
 
  • Like
Likes cyboman, Rive and Astronuc
  • #30
Wrichik Basu said:
Unfortunately, I am not an employee in Microsoft, so can't comment on details.
Why would that help? Crowdstrike is a not a Microsoft product. It is a separate publicly traded company.
 
  • #31
I'd add it's about Darwinism as it applies to software and IT. You need a diverse competitive industry that is pushing for a better product. And thus, providing checks and pressures on monopolistic players. You need a village of smart people/companies engaged, competing in a free market, which will create the redundancy and competition that ends up creating a robust and competitive front, to that product landscape. It's all about competition, a free market, and battling the compliancy that comes with being comfortable and monopolistic in any industry. Competition and a fair, unbiased, ruthless, landscape that is always checking itself which predicates compelling and exceptional outcomes.
 
Back
Top