- #1
- 2,138
- 2,713
After today's major outage and bootloop in Windows, we have yet another reason to prefer Linux to Windows.
Could it be causing this big a problem if it had been tested reasonably well? Bugs can get through any testing, but it seems strange that this bug took so many computers are down and didn't get detected in tests.Vanadium 50 said:Who says they didn't test it?
Yes, I agree. I can't believe that a software company wouldn't organizationally have testing requirements.Vanadium 50 said:I am willing to believe it wasn't tested adequately. The mess speaks for itself. I am questioning the claim that it wasn't tested at all.
I didn't know that.Vanadium 50 said:That some computers were affected and others were not
Right.Vanadium 50 said:seems to me to be suggestive and significant.
As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)FactChecker said:I didn't know that.
I mean...maybe some use Crowdstrike and some don't? Do we actually know if it did or didn't affect all updated Crowdstrike pcs? Google tells me half of the Fortune 500 use it.Vanadium 50 said:As far as one can tell by asking around, most of the engineering/CAD PCs went down,but only maybe half of the "business desktop" PCs. (Most of the scientisist use Linux and so were not impacted)
Such machines are configured differently - CAD stations tend to be newer and better provisioned. The "business" machines tend to have less memory, integrated graphics, fewer cores, etc. So maybe there is something there. Maybe its just coincidence.
They uploaded three files full of zeros!Vanadium 50 said:Who says they didn't test it?
https://finance.yahoo.com/news/crowdstrike-says-bug-quality-control-104409302.htmlLONDON (Reuters) -A software bug in CrowdStrike's quality control system caused the software update that crashed computers globally last week, the U.S. firm said on Wednesday, as losses mount following the outage which disrupted services from aviation to banking.
The extent of the damage from the botched update is still being assessed. On Saturday, Microsoft said about 8.5 million Windows devices had been affected, and the U.S. House of Representatives Homeland Security Committee has sent a letter to CrowdStrike CEO George Kurtz asking him to testify.
The financial cost was also starting to come into focus on Wednesday. Insurer Parametrix said U.S. Fortune 500 companies, excluding Microsoft, will face $5.4 billion in losses as a result of the outage, and Malaysia's digital minister called on CrowdStrike and Microsoft to consider compensating affected companies.
The outage happened because CrowdStrike's Falcon Sensor, an advanced platform that protects systems from malicious software and hackers, contained a fault that forced computers running Microsoft's Windows operating system to crash and show the "Blue Screen of Death".
"Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data," CrowdStrike said in a statement, referring to the failure of an internal quality control mechanism that allowed the problematic data to slip through the company's own safety checks.
Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?DrJohn said:Do you think it was tested properly?
It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.Astronuc said:Clearly, it wasn't properly tested. WHQL? P-code? Boot-start driver?
From https://www.cnn.com/2024/07/24/tech/crowdstrike-outage-cost-cause/index.html:FactChecker said:It sounds like there is a conflict of requirements between rapid security patches versus thorough testing and CrowdStrike had bypassed the standard test process for speed.
That being said, it also sounds like CrowdStrike had done shockingly little testing, if any.
It sounds like they need an independent partner, perhaps Microsoft, to do testing, or design and implement a more robust testing method. It seems whatever they did was somewhat unorthodox with respect to drivers in the Windows kernel if I understand Dave Plummer's explain in the video above. Maybe CS should hire Dave!renormalize said:CrowdStrike said that the testing and validation system that approved the bad software update had appeared to function normally for other releases made earlier in the year. But it pledged Wednesday to keep software glitches like last week’s from happening again, and to publicly release a more detailed analysis when it becomes available. The company added that it is developing a new check for its validation system “to guard against this type of problematic content from being deployed in the future.”
CrowdStrike has blamed faulty testing software for a buggy update that crashed 8.5 million Windows machines around the world, it wrote in an post incident review (PIR). "Due to a bug in the Content Validator, one of the two [updates] passed validation despite containing problematic data," the company said. It promised a series of new measures to avoid a repeat of the problem.
The problem forced Windows machines into a boot loop, with technicians requiring local access to machines to recover.
To prevent DDoS and other types of attacks, CrowdStrike has a tool called the Falcon Sensor. It ships with content that functions at the kernel level (called Sensor Content) that uses a "Template Type" to define how it defends against threats. If something new comes along, it ships "Rapid Response Content" in the form of "Template Instances."
A Template Type for a new sensor was released on March 5, 2024 and performed as expected. However, on July 19, two new Template Instances were released and one (just 40KB in size) passed validation despite having "problematic data," CrowdStrike said. "When received by the sensor and loaded into the Content Interpreter, [this] resulted in an out-of-bounds memory read triggering an exception. This unexpected exception could not be gracefully handled, resulting in a Windows operating system crash (BSOD)."
To prevent a repeat of the incident, CrowdStrike promised to take several measures. First is more thorough testing of Rapid Response content, including local developer testing, content update and rollback testing, stress testing, stability testing and more. It's also adding validation checks and enhancing error handing.
. . . .
Ouch!As most airlines began to recover following the technology issues, Delta has struggled to restore operations to full capacity, now resorting to manually resetting each system affected by the meltdown.
In a statement on Monday, Delta said over half of its IT systems globally are Windows-based.
"The CrowdStrike error required Delta's IT teams to manually repair and reboot each of the affected systems, with additional time then needed for applications to synchronize and start communicating with each other," the statement said.
It explained that the airline's crew-tracking-related tool, which ensures flights are fully staffed, has needed manual support to repair.
On Sunday, Delta CEO Ed Bastian said the critical tool could not effectively process the number of changes prompted by the Microsoft Windows operating system's shutdown.
Production ready? Like CrowdStrike was?Wrichik Basu said:Microsoft's eBPF support for Windows is not yet production-ready.
Normally I comment on "some guy on Twitter" as being a not-so-reliable source, but a) in this case he is right, and b) the vast majority of crashes are from uninitialized or otherwise bad pointers being dereferrenced, so it's a no-brainer.Wrichik Basu said:One person on Twitter said that the issue was due to referencing a non-existent memory address in the underlying C++ code
Never attribute to malice what can be explained by incompetence; the reason for this is that there are limits to malice.rcgldr said:I'm wondering if this was a "parting gift" from a disgruntled employee leaving the company.
Unfortunately, I am not an employee in Microsoft, so can't comment on details.Vanadium 50 said:Production ready? Like CrowdStrike was?
CrowdStrike says it’s not to blame for the recent flight chaos in the U.S.
The cybersecurity firm said Sunday it had minimal legal liability over the disruption in mid-July.
Don't blame the customer for one's negligence.Over the weekend, CrowdStrike reiterated its apology.
But in a letter from a lawyer, it also said it was disappointed by any allegation that it was negligent.
The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.
Now it says the carrier should explain why it turned down free onsite help, and why rival airlines were able to get their systems back online much more quickly.
They sent an email, I guess?Astronuc said:The firm says it reached out to Delta to offer assistance when the outage occurred, but never heard back.
Those statements are not even on the same axisCrowdStrike says it’s not to blame for ...
had minimal legal liability ...
Astronuc said:Don't blame the customer for one's negligence.
Why would that help? Crowdstrike is a not a Microsoft product. It is a separate publicly traded company.Wrichik Basu said:Unfortunately, I am not an employee in Microsoft, so can't comment on details.