Troubleshooting Linux Kernel Issues with Older CPUs

In summary: Yeah, that sounds about right.In summary, the older kernel (4.18.0-425.3.1.el8) works fine on two of the five machines, while 4.18.0-425.10.1.el8_7 works on two machines. The other two machines have issues with the newer kernel (4.18.0-425.3.1.el8) - CPU Stall is reported, but no obvious problems are found in the logs. Booting to 10.1 fixed an issue with a LOM on one machine. Fedora 37 is being tried on an old Ivy Bridge machine, which is doing well.
  • #1
Vanadium 50
Staff Emeritus
Science Advisor
Education Advisor
2023 Award
35,005
21,673
I am not sure what the next step is: Linux kernel 4.18.0-425.3.1.el8.x86_64 work s fine. 4.18.0-425.10.1.el8_7.x86_64 works on two of my five machines. On the other two, I get the screen messages "CPU Stall" but nothing obvious in the log files.

The troubled machines have older CPUs (Haswell) and no not run dkma.

My thinking was that maybe the newer kernel was built with incompatible options, and I should just whip out a recompiled one. Let's just say that this is not fast plan.

Thoughts? I don't really want to go down the path of trying a hundred different things, each with a 1% chance of success.
 
Computer science news on Phys.org
  • #2
Vanadium 50 said:
Thoughts?
  • You are running a distribution kernel, probably best to follow up through the distro's support channel.
  • 4.18 is not a long term support mainline kernel version (4.19 is but it is quite new), but may be for your distro (what distro is it?)
  • I don't have any Haswell hardware live, but as it happens I have an old box sitting right next to me I found at the bottom of a pile last week. I'll fire it up and see how it likes Fedora.
  • [Update] My old box is already running 5.4.0-60-generic under Mint 20. It's actually older than Haswell (i5-3470). Untouched since January 2021 it seems. It looks mighty fine on my new 2560x1440 screen, I might try plugging in the other one :)
 
Last edited:
  • #3
I'm not sure there is a support channel for Rocky. Rocky is a replacement for Centos, which was a community supported version of RHEL. RedHat decided they didn't want to do that any more, so they turned Centos into something more Fedora-like.

Ny theory is that they changed the gcc flags for 10.3, possibly unintentionally. Most likely from x87_64 to broadwell or later. Then the CPU "stall" is actually a hang from an unrecognized instruction.

I think dkms is unlikely to be the culprit, but since it needs kernel-source, it's not impossible that it forces a rebuild of something originally compiled with the wrong flags. But I list it for completeness.

The other odd thing is that if I boot 10.1, and then 3.1, (the previous version) one machine loses it's LOM. I think it's some version of an RT 8111. It shows up in lspci, but nmcli won't turn it on. Booting to the BIOS and back out fixed it. I don't know if this happens every time.

I am wondering if the right move is to wait for the next kernel update and hope it undoes whatever went wonky with this one.
 
  • #4
Vanadium 50 said:
I'm not sure there is a support channel for Rocky.
https://forums.rockylinux.org/

Looks like a connected issue: https://forums.rockylinux.org/t/latest-update-8-7-4-18-0-425-10-1-e18-7-x86-64-soft-lockup/8570

Vanadium 50 said:
I am wondering if the right move is to wait for the next kernel update and hope it undoes whatever went wonky with this one.
Perhaps, although flagging the issue might help prevent them overlooking the regression, if that's what it is.

You could also try switching to Fedora: I can't see much point in using a downstream derivative (Rocky) of a downstream derivative (RHEL) of Fedora if the end result is less stability rather than more!

Now about to try Fedora 37 on this 10-year-old Ivy Bridge hardware.
 
  • Like
Likes Vanadium 50
  • #5
pbuk said:
It looks mighty fine on my new 2560x1440 screen
Are you in an Emergency Operations Center maybe, or maybe "someplace" in Colorado?

220px-NORADCommandCenter.jpg

https://en.wikipedia.org/wiki/Cheyenne_Mountain_Complex
 
  • Haha
Likes pbuk
  • #6
berkeman said:
Are you in an Emergency Operations Center maybe
Yes, our spare room has fulfilled that function for nearly three years now!

Still can't get a screen to work on the DVI port, but otherwise
1673915847121.png
 
  • Like
Likes berkeman
  • #7
Thanks!

This sound plausible. I'm kind of leaning to cuda (via kmod) as the culprit, as the VGA is also different. The errors look a little different, but that could just be sequencing.

Why not Fedora? My home systems look like my work systems. I did do Fedora for a while, but stuff kept breaking when I moved from one to the other. Same reason I am still on the 8.x branch and not the 9.x branch.
 
  • #8
Vanadium 50 said:
This sound plausible. I'm kind of leaning to cuda (via kmod) as the culprit, as the VGA is also different. The errors look a little different, but that could just be sequencing.
Sounds likely, I have no experience of cuda without dkms (and it may well be that this is not in Rocky's regression tests). Leads to the obvious question: why kmod over dkms? Probably because:
Vanadium 50 said:
Why not Fedora? My home systems look like my work systems.
How about registering for a free RHEL Developer license? https://developers.redhat.com/articles/faqs-no-cost-red-hat-enterprise-linux

Edit: both screens now working (the graphics card didn't like DVI -> Display Port but DVI -> HDMI is fine).

Screenshot from 2023-01-17 09-51-50.png
 
Last edited:
  • #9
If you remember, for a long time dkms was flaky, Before that, nvidia wanted us building the drivers ourselves every time there was a new kernel (which was even flakier.

Really, I dom't want anything complicated. Run work stiff at home, and for any non-work stuff, it really doesn't matter what distro I use. At work they use RedHat. Fine. So I run Ricky. Before that, Centos. Before that Scientific Linux. Before that, White Box. Obviously IBM hates this model, I don't understand why - it's not like there is money to be squeezed out from people who are running a few computers with obsolete hardware.

Heck, I am still running one AMD Bulldozer on a system that Just Won't Die. Hotter than heck, and slow by today's standards, but keeps chuggling along. I'm not going to pay a support contract for this, and IBM foesn't want to support it either. Just let it run until the CPU fan goes, at which time a whole new CPU is cheaper than a replacement fan,
 
  • #10
If you want a cheap server class Linux machine upgrade the used HP machines on the market are great with a GT 1030 as the video card. A little noisy for a home office without fan hacks but solid as a tank with full redundancy.
d75rm6ySa8GPVn46MbgVo9ceA=w1239-h929-no?authuser=0.jpg

zD0nGaAtC_Jsec9fbCo8y5aDg=w1239-h929-no?authuser=0.jpg

HP ProLiant DL360p Gen8 1U RackMount 64-bit Server with 2×6-Core E5-2640 Xeon 2.5GHz CPUs + 64GB PC3-10600R RAM + 8×300GB 10K SAS SFF HDD, P420i RAID, 4×GigaBit NIC, 2×Power Supplies, NO OS

Under $300.

I run https://www.devuan.org/ which is a port of Debian without systemd. Very compatible.
 
  • #11
You are right, there are some great buys out there if you don't mind noise and age. A while back a number of 28C56T servers were on the market - great if you could use them, but keeping this many threads busy is not trivial.

Ny problem is the opposite - I don't want to throw anything out that still works. Hence the Bulldozer. Twelve years and still going.
 
  • #12
Vanadium 50 said:
You are right, there are some great buys out there if you don't mind noise and age. A while back a number of 28C56T servers were on the market - great if you could use them, but keeping this many threads busy is not trivial.

Ny problem is the opposite - I don't want to throw anything out that still works. Hence the Bulldozer. Twelve years and still going.
I don't throw anything out, I just keep building more storage sheds on the property. The fan noise is a problem solved by remoting the actual server from the workstation space.
YrxKD_T9XiutuzPVAKFVTfN6aw=w696-h928-no?authuser=0.jpg

Last summer during the local heat wave it sounded like a 747 takeoff when the room temps were in the 90's but the HP servers ILO monitor never came close to the sensor warning limits.

Yes, that's an old DEC disk rack. Another one is somewhere in storage. o0)
 
  • Like
Likes Wrichik Basu
  • #13
Yeah...I'm not that hard core. (No pun intended)
 
  • Love
Likes nsaspook
  • #14
The problem is back,.

4.18.0-425.13.1.el8_7.x86_64 works.
4.18.0-425.19.2.el8_7.x86_64 fails with a watchdog/CPU stuck error (slightly different from the original error)
4.18.0-425.10.1.el8_7.x86_6 is akso problrmatic

What does the Rocky forum say? No answer. A few people have had similar issues, one with USB, two with ZFS and one with nvidia/cuda. The one common factor seems to be dkms, so I tried to boot 19 with everything that uses dkma. I dropped everything that uses it, and tried, but it didn't change anything.

Also, the problem now affects Bulldozers as well as the Hasewells.

The interwebs suggests changing the timeout from 20 to 30 seconds, which seems to me to be an improbable fix.

If I get a CPU stick message and do a warm boot into a known good kernel, the network card is unrecognized. I am not sure if this is significant. If I power cycle, then it's OK.

What I think I'd like to do is to somehow tell what flags the kernels were compiled under. While building a kernel from source is a little adventurous, I'd like to at least see if there is something obvious in the flags.
 
  • #15
How about a bad driver for the network card? It sounds like the network card locked up, requiring a hard reboot.

Or it could be a crash of Something else that happens to trash-talk the network card.

Much work, little utility other than satisfying your curiousity, but you could try:

1. booting to a working distro,
2. setting up a virtual machine,
3. and booting the problem distro into the virtual while tracing and logging the boot process.
 
  • #16
Tom.G said:
How about a bad driver for the network card?
Three different ones. Two from QualComm, one frrom Intel.
Tom.G said:
satisfying your curiousity
This is low on my priority list.

Ehat I would like is yum (OK, dnf) to just work. Failing that, I'd like a procedure that woule get me a working kernel. Failing that, I'd like a way of testing a kernel to see if it would work without risking a stall.

Seeing if the kernels were compiled with bad flags, particularly CPU flags, seems like a good first step.
 
  • #17
So 8.8 was released and all the machines are hapy with their kernels. And while my curiosity is falling now that I no longer have a problem, I still suspect kernel updates compiled with the wrong CPU type. (--march) I wish there were some way to check this.
 
  • #18
After some more spelunking, I found some things,
  • Flags for building the kernel are in /boot. Why boot and not the kernel source? No idea.
  • There art about 8000 of them.
  • Maybe 50 (!) are different between kernel minor (!) revisions
  • They look more like cmake or similar flags than gcc flags.
 
  • Wow
Likes Tom.G
  • #19
New kernel released today. Zero problems yet.

I'm hoping it was a) a build issue that b) is fixed now.
 
  • #20
Vanadium 50 said:
New kernel released today. Zero problems yet.

I'm hoping it was a) a build issue that b) is fixed now.
Ah yes, Hope! That non-performing thing that springs eternal! :wink:
 
  • Haha
Likes Vanadium 50
  • #21
What's the spring constant on that?
 
  • #22
Don't know, but its gotta be either 0 or ∞.

(I wonder why that ∞ symbol is so small for such a large concept. :oldconfused:)
 
  • #23
Vanadium 50 said:
aybe 50 (!) are different between kernel minor (!) revisions
In the newest working kernel, the flags are identical to the previous (and also working) kernel.
 
  • #25
Um....yeah....we'll go with that.

Linux has a few fundamental flaws:
  1. A base of coders who don't subscribe to "leave well enough alone". Why are they dumping Paul Vixie's cron? Will the new cron cron any better?
  2. A kernel that does everything is fragile. Modules were supposed to help, but it's not very modular.
  3. There is no strong central designer to say "that functionality goes over there, and not in some other place".
 

FAQ: Troubleshooting Linux Kernel Issues with Older CPUs

Can older CPUs cause issues with the Linux kernel?

Yes, older CPUs can sometimes cause compatibility issues with the Linux kernel. This can result in performance degradation, system instability, or even kernel panics.

How can I determine if my older CPU is causing kernel issues?

You can troubleshoot kernel issues related to older CPUs by checking system logs for error messages, monitoring system performance, and running diagnostic tools like stress tests or CPU benchmarks.

Are there specific kernel parameters I can adjust for older CPUs?

Yes, you can adjust kernel parameters such as CPU frequency scaling, power management settings, and CPU-specific optimizations to improve compatibility and performance on older CPUs.

Can updating the Linux kernel resolve issues with older CPUs?

Updating the Linux kernel can sometimes resolve compatibility issues with older CPUs by introducing new features, bug fixes, and performance improvements that address specific CPU-related issues.

Is it possible to run a customized kernel optimized for older CPUs?

Yes, you can create a customized kernel configuration tailored to older CPUs by enabling specific optimizations, disabling unnecessary features, and fine-tuning performance settings to improve compatibility and performance.

Back
Top