Crowdstrike debacle for Microsoft Windows Update

Wrichik Basu · Jul 19, 2024

After today's major outage and bootloop in Windows, we have yet another reason to prefer Linux to Windows.

DrJohn · Jul 19, 2024

CrowdStrike was the source of today's problem, crashing customers' services in the Windows cloud system.

Who is daft enough to send out an update to all its customers worldwide simultaneously without testing it? CrowdStrike should have sent the update to one (small) region's customers, then waited for news of any problems when installed.

Or very very much better still, just tested it on a few machines of their own held in the cloud just to see what happens!

Vanadium 50 · Jul 19, 2024

Who says they didn't test it?

FactChecker · Jul 19, 2024

Vanadium 50 said:

Who says they didn't test it?

Could it be causing this big a problem if it had been tested reasonably well? Bugs can get through any testing, but it seems strange that this bug took so many computers are down and didn't get detected in tests.

Vanadium 50 · Jul 19, 2024

I am willing to believe it wasn't tested adequately. The mess speaks for itself. I am questioning the claim that it wasn't tested at all. That some computers were affected and others were not seems to me to be suggestive and significant.

FactChecker · Jul 19, 2024

Vanadium 50 said:

I am willing to believe it wasn't tested adequately. The mess speaks for itself. I am questioning the claim that it wasn't tested at all.

Yes, I agree. I can't believe that a software company wouldn't organizationally have testing requirements.

Vanadium 50 said:

That some computers were affected and others were not

I didn't know that.

Vanadium 50 said:

seems to me to be suggestive and significant.

Right.

Vanadium 50 · Jul 19, 2024

FactChecker said:

I didn't know that.

As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)

Such machines are configured differently - CAD stations tend to be newer and better provisioned. The "business" machines tend to have less memory, integrated graphics, fewer cores, etc. So maybe there is something there. Maybe its just coincidence.

russ_watters · Jul 19, 2024

Vanadium 50 said:

As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)

Such machines are configured differently - CAD stations tend to be newer and better provisioned. The "business" machines tend to have less memory, integrated graphics, fewer cores, etc. So maybe there is something there. Maybe its just coincidence.

I mean...maybe some use Crowdstrike and some don't? Do we actually know if it did or didn't affect all updated Crowdstrike pcs? Google tells me half of the Fortune 500 use it.

Or...are CAD machines more likely than business machines to be powered on at 2am?

Vanadium 50 · Jul 20, 2024

I don't work in IT so can't tell you. I can say that all machines are centrally managed, and are supposed to stay on at night. Whether engineers are more compliant than accountants, I can't really say.

It strikes me as unlikely that IT would decide that this machine needs protection and that one does not, but it is a logical possibility. All I know for sure is that not every machine was impacted, and there seems to be a correlation between which ones were and were not hit. (And we all know about correlation and causality)

DrJohn · Jul 23, 2024

Vanadium 50 said:

Who says they didn't test it?

They uploaded three files full of zeros!

See the above for an explanations. Do you think it was tested properly?

Astronuc · Jul 24, 2024

LONDON (Reuters) -A software bug in CrowdStrike's quality control system caused the software update that crashed computers globally last week, the U.S. firm said on Wednesday, as losses mount following the outage which disrupted services from aviation to banking.

The extent of the damage from the botched update is still being assessed. On Saturday, Microsoft said about 8.5 million Windows devices had been affected, and the U.S. House of Representatives Homeland Security Committee has sent a letter to CrowdStrike CEO George Kurtz asking him to testify.

The financial cost was also starting to come into focus on Wednesday. Insurer Parametrix said U.S. Fortune 500 companies, excluding Microsoft, will face $5.4 billion in losses as a result of the outage, and Malaysia's digital minister called on CrowdStrike and Microsoft to consider compensating affected companies.

The outage happened because CrowdStrike's Falcon Sensor, an advanced platform that protects systems from malicious software and hackers, contained a fault that forced computers running Microsoft's Windows operating system to crash and show the "Blue Screen of Death".

"Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data," CrowdStrike said in a statement, referring to the failure of an internal quality control mechanism that allowed the problematic data to slip through the company's own safety checks.

https://finance.yahoo.com/news/crowdstrike-says-bug-quality-control-104409302.html
https://news.yahoo.com/news/finance/news/crowdstrike-says-bug-quality-control-104409772.html

DrJohn said:

Do you think it was tested properly?

Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?

https://en.wikipedia.org/wiki/P-code_machine#Microsoft_P-Code

https://learn.microsoft.com/en-us/windows-hardware/drivers/install/installing-a-boot-start-driver

FactChecker · Jul 24, 2024

Astronuc said:

Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?

It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.
That being said, it also sounds like CrowdStrike had done shockingly little testing, if any.

renormalize · Jul 24, 2024

FactChecker said:

It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.
That being said, it also sounds like CrowdStrike had done shockingly little testing, if any.

From https://www.cnn.com/2024/07/24/tech/crowdstrike-outage-cost-cause/index.html:
CrowdStrike said that the testing and validation system that approved the bad software update had appeared to function normally for other releases made earlier in the year. But it pledged Wednesday to keep software glitches like last week’s from happening again, and to publicly release a more detailed analysis when it becomes available. The company added that it is developing a new check for its validation system “to guard against this type of problematic content from being deployed in the future.”

Also, their proposed new "approach" sounds way overdue:
And CrowdStrike said it also plans to move to a staggered approach to releasing content updates so that not everyone receives the same update at once, and to give customers more fine-grained control over when the updates are installed.

As for the cost of the failure:
What’s been described as the largest IT outage in history will cost Fortune 500 companies alone more than $5 billion in direct losses, according to one insurer’s analysis of the incident published Wednesday.

Are CrowdStrike's pockets deep enough if they're held financially liable for this?

Astronuc · Jul 24, 2024

renormalize said:

CrowdStrike said that the testing and validation system that approved the bad software update had appeared to function normally for other releases made earlier in the year. But it pledged Wednesday to keep software glitches like last week’s from happening again, and to publicly release a more detailed analysis when it becomes available. The company added that it is developing a new check for its validation system “to guard against this type of problematic content from being deployed in the future.”

It sounds like they need an independent partner, perhaps Microsoft, to do testing, or design and implement a more robust testing method. It seems whatever they did was somewhat unorthodox with respect to drivers in the Windows kernel if I understand Dave Plummer's explain in the video above. Maybe CS should hire Dave!

Engadget explanation - CrowdStrike blames bug that caused worldwide outage on faulty testing software
The faulty update caused an out-of-bounds memory read that triggered an 'unrecoverable exception.'
https://www.engadget.com/crowdstrik...age-on-faulty-testing-software-120057494.html

CrowdStrike has blamed faulty testing software for a buggy update that crashed 8.5 million Windows machines around the world, it wrote in an post incident review (PIR). "Due to a bug in the Content Validator, one of the two [updates] passed validation despite containing problematic data," the company said. It promised a series of new measures to avoid a repeat of the problem.

The problem forced Windows machines into a boot loop, with technicians requiring local access to machines to recover.

To prevent DDoS and other types of attacks, CrowdStrike has a tool called the Falcon Sensor. It ships with content that functions at the kernel level (called Sensor Content) that uses a "Template Type" to define how it defends against threats. If something new comes along, it ships "Rapid Response Content" in the form of "Template Instances."

A Template Type for a new sensor was released on March 5, 2024 and performed as expected. However, on July 19, two new Template Instances were released and one (just 40KB in size) passed validation despite having "problematic data," CrowdStrike said. "When received by the sensor and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."

To prevent a repeat of the incident, CrowdStrike promised to take several measures. First is more thorough testing of Rapid Response content, including local developer testing, content update and rollback testing, stress testing, stability testing and more. It's also adding validation checks and enhancing error handing.
. . . .

Astronuc · Jul 24, 2024

Delta has to manually reset each computer system affected by the mass IT outage
https://news.yahoo.com/news/delta-manually-reset-computer-system-105716336.html

As most airlines began to recover following the technology issues, Delta has struggled to restore operations to full capacity, now resorting to manually resetting each system affected by the meltdown.

In a statement on Monday, Delta said over half of its IT systems globally are Windows-based.

"The CrowdStrike error required Delta's IT teams to manually repair and reboot each of the affected systems, with additional time then needed for applications to synchronize and start communicating with each other," the statement said.

It explained that the airline's crew-tracking-related tool, which ensures flights are fully staffed, has needed manual support to repair.

On Sunday, Delta CEO Ed Bastian said the critical tool could not effectively process the number of changes prompted by the Microsoft Windows operating system's shutdown.

Ouch!

cyboman · Jul 25, 2024

My immediate reaction is there needs to be more players in the space than crowdstrike.

Rive · Jul 25, 2024

My immediate reaction: in gardening/botany it's common knowledge that too many clones in/from the greenhouse is just asking for trouble.

Vanadium 50 · Jul 25, 2024

Too many clones almost wiped out bananas. But the conclusion was that the thing to do was to switch from one clone to another. The problem with the "we need more products" is how you tell people they can't use their preferred product "for the common good".

The problem is that we want two incompatible things - we want instantaneous protection from new attacks, ideally before Day Zero, and we want every bit of code tested for days or weeks before rolling it out. Can't have both.

There is a design feature of Windows (and unix) that makes this worse. Windows has two protection rings: Ring 0, or kernel mode, and Ring 1 (which is actually Ring 3) which is user mode. Ring 3 problems crash the app. Ring 0 problems crash the system. ClownSrtike...er...CloudStrike needs to examine everything that is running in Ring 3, so Windows needs to run it in Ring 0.

However, it only needs read-access. You could, if Windows supported it, run it in Ring 2, giving it read-only access to everything that was running. Then if it crashed, it wouldn't take everything with it. You'd be running unprotected, which is bad, but you'd be running.

DrJohn · Jul 25, 2024

CrowdStrike has now admitted that deploying their software worldwide in one go is a bad idea, and they will now switch to deploying it in stages, so the whole world will not go down in one go.

Gosh what a clever idea. One that I suggested as soon as I read they had deployed it worldwide in one go. Perhaps they will even follow my other suggestion - rent a bunch of machines in the cloud for their own use, and deploy only to that group for testing, before it goes out to everyone else. These machines in the cloud could be as simple as a few test databases and a few test webservers that then go through some tests to check they are all working the same way as last week.

Vanadium 50 · Jul 25, 2024

And if the machines not at the head of the line get infected while they are waiting their turn is that also CloudStrike's fault?

Wrichik Basu · Jul 26, 2024

One person on Twitter said that the issue was due to referencing a non-existent memory address in the underlying C++ code:

CrowdStrike also releases software for Linux and MacOS. However, thanks to CrowdStrike already adopting eBPF on Linux, such issues are much less likely to occur. This article throws some light on this aspect: https://brendangregg.com/blog/2024-07-22/no-more-blue-fridays.html

Microsoft's eBPF support for Windows is not yet production-ready.

Vanadium 50 · Jul 26, 2024

Wrichik Basu said:

Microsoft's eBPF support for Windows is not yet production-ready.

Production ready? Like CrowdStrike was?

Wrichik Basu said:

One person on Twitter said that the issue was due to referencing a non-existent memory address in the underlying C++ code

Normally I comment on "some guy on Twitter" as being a not-so-reliable source, but a) in this case he is right, and b) the vast majority of crashes are from uninitialized or otherwise bad pointers being dereferrenced, so it's a no-brainer.

In this case, there is a pointer that is a base plus an offset, but the base was never initialized, so it points nowhere. When the pointer is attempted to be dereferenecd, it goes boom.

rcgldr · Jul 26, 2024

I'm wondering if this was a "parting gift" from a disgruntled employee leaving the company.

Rive · Jul 26, 2024

They may wish so, but I really doubt that.

Vanadium 50 · Jul 26, 2024

rcgldr said:

I'm wondering if this was a "parting gift" from a disgruntled employee leaving the company.

Never attribute to malice what can be explained by incompetence; the reason for this is that there are limits to malice.

Wrichik Basu · Jul 26, 2024

Vanadium 50 said:

Production ready? Like CrowdStrike was?

Unfortunately, I am not an employee in Microsoft, so can't comment on details.

Astronuc · Aug 5, 2024

https://www.yahoo.com/news/crowdstrike-outage-latest-signal-united-160300130.html

CrowdStrike says it’s not to blame for the recent flight chaos in the U.S.

The cybersecurity firm said Sunday it had minimal legal liability over the disruption in mid-July.

Over the weekend, CrowdStrike reiterated its apology.

But in a letter from a lawyer, it also said it was disappointed by any allegation that it was negligent.

The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.

Now it says the carrier should explain why it turned down free onsite help, and why rival airlines were able to get their systems back online much more quickly.

Don't blame the customer for one's negligence.

Rive · Aug 5, 2024

Astronuc said:

The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.

They sent an email, I guess?

CrowdStrike says it’s not to blame for ...
had minimal legal liability ...

Those statements are not even on the same axis o0)

Vanadium 50 · Aug 5, 2024

Astronuc said:

Don't blame the customer for one's negligence.

"It's your fault, customer. You should have picked a reputable vendor."
Not exactly a winning argument.

glappkaeft · Aug 5, 2024

Wrichik Basu said:

Unfortunately, I am not an employee in Microsoft, so can't comment on details.

Why would that help? Crowdstrike is a not a Microsoft product. It is a separate publicly traded company.

cyboman · Sep 1, 2024

I'd add it's about Darwinism as it applies to software and IT. You need a diverse competitive industry that is pushing for a better product. And thus, providing checks and pressures on monopolistic players. You need a village of smart people/companies engaged, competing in a free market, which will create the redundancy and competition that ends up creating a robust and competitive front, to that product landscape. It's all about competition, a free market, and battling the compliancy that comes with being comfortable and monopolistic in any industry. Competition and a fair, unbiased, ruthless, landscape that is always checking itself which predicates compelling and exceptional outcomes.

Crowdstrike debacle for Microsoft Windows Update

Similar threads

Hot Threads

Is AI Overhyped?

How to disable AI responses in Google Searches?

More on Distributing High Quality Audio

Looking For Ideas for a Hackathon: 'AI-Driven Diagnostic Efficiency & Solution'

On Progress Toward AGI

Recent Insights

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers