Unusual Bug: 40 Years of Programming and a Read() Failure

Chris Miller · Dec 18, 2018

Been tracking down the most unusual bug I've seen in 40 years of programming (mostly in assemlber and C). It manifests in the test suite for an object oriented XBase language implementation that's been ported everywhere from Palm Pilot to OS/2 to the Blackberry tablet to Windows to, now, Qnx7 (from Qnx4, from Qnx2).

I've tracked it down to a read() failure. A read(int fd, char *buffer, int bytes) call returns bytes with errno=0, but the wrong bytes are in the buffer (with some not overwritten at all). A second read() into a different buffer from the same file offset returns the correct ones. It sometimes only manifests after several passes through the test suite. A 100 passes is always enough. Running concurrently with another process pounding the file system seems to reduce if not eliminate its occurrence.

But the weirdest thing about it for me is what "fixes" it. Instead of reading n bytes directly into the buffer passed to my calling/cover function, I malloc a pointer to n bytes, read into that, memcpy these n bytes into the buffer passed to my cover function, and free the mallcoed pointer. Then all tests work perfectly. To me, this is a complete no-op, and seems to rule out memory corruption and compiler errors (which I've tested for to a fair extent anyway). I'd be curious to hear anyone's thoughts on this.

jedishrfu · Dec 18, 2018

INteresting bug, its as if the buffer location gets corrupted before the read perhaps its an optimizing compiler thing where your covering function hasn't referenced the buffer until the read.

You could try writing a zero byte into the buffer prior to the read and see if that hack "fixes" the issue.

C:

buffer[0]=NULL;   // or whatever value you want in case you want to verify that it got overwritten.
read(ifd,buffer,nbytes)

One other thought is whether the number of bytes has been corrupted or zeroed.

Perhaps you could use a debugger to see what the argument values are at the read() call.

I had an odd case like this once with FORTRAN where the file descriptor in the read couldn't be of type array of int. It had to be a literal integer like 05 or something. It was tracked down to being a compiler issue at the time where it used a CPU addressing register to hold the value.

Filip Larsen · Dec 18, 2018

Without your fix, where is the buffer then allocated? Is there anyone else possibly tampering with that buffer or the memory block its in?

jedishrfu · Dec 18, 2018

Good point, if this is a multi-threaded program.

Chris Miller · Dec 18, 2018

Thanks for your thoughts all. It's not a multi-threaded program.
@jedishrfu : I've tried the zero-out-buffer first, to no effect. Also played with various optimizations, e.g., size, speed, none. No effect. Only malloc, read, memcpy "fixes." I've also checked the file pos, and dumped the reads. In the bad read, buffer bytes begin at the wrong file pos, contain unwritten patches (still all zeros) but end correctly. The return from read is the bytes requested.

@Filip Larsen: the buffer's allocated as part of an internal cache, which I clobber (memcpy) into anyway, exactly as if I'd read the bytes into it. But only when I read the bytes directly into it do the wheels come off. Qnx is pretty good about protecting memory, and stepping into others' space causes a SIGSEGV abort. The test program is single threaded and no one else could touch this buffer.

I've been coding long enough to still be pretty sure it's my bad, am still looking, I just don't see how it's possible.

jedishrfu · Dec 18, 2018

My suggestion then is to use a debugger.

I once had an issue witha C program where the library I was using was not compatible with the settings I compiled with specifically with placement of doubles and floats on multiples of 8 boundaries. I would pass in the data spaced at 4 byte or even byte spacing (my settings) but the function I called expected it to be at 8 byte (previously compiled by someone else) and so it got things totally wrong.

It took using a debugger like gdb to find the gotcha.

Memory alignment:

https://en.wikipedia.org/wiki/Data_structure_alignment

Here's a related although different example with structs in C:

https://www.geeksforgeeks.org/structure-member-alignment-padding-and-data-packing/

And here's a NASM tutorial I found that while not completely relevant is pretty cool:

http://cs.lmu.edu/~ray/notes/nasmtutorial/

Klystron · Dec 18, 2018

Curious what resources are assigned to the pointer. Your fix appears to zero the pointer then 'force' the correct address of the I/O buffer. The malloc section seems like smart coding given the cross-platform nature of your system. Is brevity a primary requirement?

Had a similar problem porting FORTRAN coded sub-routines into a mixed language and platform environment. Back then the big issue was cross-platform "big-endian, little-endian" problems along with problematic array declarations and dimensions. Well behaved pointers seemed to shift or offset from expected value munging array values. IRC initializing the pointers and buffer arrays mitigated the problem (at the expense of additional cycles).

[The number of read function calls affecting the fault tweaks my memory -- how the OS stores i/o interrupts. Clue: OP states problem disappears during heavy file access. ]

Chris Miller · Dec 18, 2018

@jedishrfu : I pack all internal structures on 1-byte boundaries using #pragmas. Keep in mind, the tests can run dozens, even hundreds, of times successfully before the fail. And my "fix" (which does indeed fix the bug) would not address any of these issues. Debuggers typically aren't useful for these types of things, where the problem manifests indirectly long after the actual corruption. I'll check out the links though, thanks.

@Klystron: Resources? Only calloc()ed memory. There've been a few 32-bit to 64-bit porting issues, mostly all addressed under Windows. My no-op "fix" wouldn't have helped those Fortran bugs as I understand them.

Klystron · Dec 18, 2018

Chris Miller said:

@jedishrfu : I pack all internal structures on 1-byte boundaries using #pragmas. Keep in mind, the tests can run dozens, even hundreds, of times successfully before the fail. ...
@Klystron: Resources? Only calloc()ed memory. There've been a few 32-bit to 64-bit porting issues, mostly all addressed under Windows. ...

Run-time reads. Character data. Problem manifests during 'light' file I/O but not during heavy I/O. ...library interface suspect?

Blackberry QNX well documented https://www.qnx.com/developers/docs/7.0.0/#com.qnx.doc.ide.userguide/topic/leak_selecting_tool.html

jedishrfu · Dec 18, 2018

Okay just don’t discount what we’ve suggested because clearly the bug is where you aren’t looking. Using a debugger helps to give you confidence that that’s not the problem.

I’ve had programs which work fine in development with intermediate print scaffolding but once they are removed things go south fast.

Tom.G · Dec 19, 2018

If it is running under an operating system on a multi-core processor, try forcing it to a particular core.

I once ran across a situation of a disk utility running under Windows on a dual-core processor that would go absolutely crazy if it wasn't restricted to one core. I never did find out the details, but it appeared that the OS would occassionally switch which core was running the pgm, whle the pgm assumed it was on a single core CPU.

Klystron · Dec 19, 2018

Additional background info: the current QNX designs support Blackberries as embedded systems in automobiles and RPVs (remote pilot vehicles). Consider the increased probability of run-time mismatches in this environment. Difficult for OP to prove that test runs in debug mode reliably test run-time conditions. As debugger increases I/O it could hide or mask the original problem which sporadically reoccurs after rebuild.

As mentor stated, this should be a common programming problem with this type of system suggesting IMO careful attention to the build.

Chris Miller · Dec 20, 2018

I really appreciate all the feedback here. This is the first bug in over 40 years of coding in numerous languages on at least a dozen OSes I think I'm going to walk away from, especially since my no-op fix, which makes no sense, appears to correct it. Clearly, something is making read() lose its mind. For testing, I've placed all my mallocs into a linked list that I can, and do, check for any corruption, and to ensure read() is reading into a valid buffer space, which it is. I've even logged all mallocs, frees and I/O into array of structures that I've replayed in a small test program, but that does not reproduce the problem.

Because my fix (instead of reading into buffer: malloc temp, read into temp, copy temp to buffer, free temp) wouldn't address memory corruption, and because a 1 second delay between test loop iterations (i.e., closing and reopening the table) also makes the problem go away, I'm going to call it something other than a bug in the program. Either in the Qnx7 OS, or, more likely, our configuration and installation of it.

jedishrfu · Dec 20, 2018

Remember: Do NOT put any comments around your hack because there's no need to worry the youngsters who will come after you to marvel and wonder why you did the hack before they remove it to their chagrin.

I used to humorously tell my corporate students to always remember to leave a bug after they've fixed a bug for future generations to find. A kind of computer ecology proverb.

Another recommendation that I'd mention is to always remove your name from any comments in your code after a promotion because that code will follow you around forever if you don't.

I had one case in GE where I obsoleted a program I wrote to find Tape drive errors based on the batch runs with tape failures. The goal was to find the failing drive. It didn't work well because many jobs used multiple drives enough to ruin the statistical search (maybe deep learning could have helped - not really circa 1980's and not enough data to train on).

I destroyed all the backup decks I had. Except there was one I didn't know about, made by a computer floor supervisor, who came to me to get it working again. I had to help him because he was a great mentor who always helped me get jobs run when I was a teenager in the local Boy Scout Explorer Post.

Chris Miller · Dec 21, 2018

@jedishrfu: There's always one more bug, so no need to worry. Comments schmomments! I My first gig out of college was at a Honeywell-Bull shop documenting HVAC apps coded in (ick) COBOL. Boss wrote a program to strip all comments from goto-riddled company source code. Said they only made it harder to understand. My hero.

Found another fix to my weird Qnx7 bug. Adding a fflush(stdout) after an fwrite to stdout (part of screen/terminal handling), fixed the subsequent bizarre read() failure, which makes me think it was caused by a stdout buffer overflow. But this still doesn't explain why the kludge worked, probably isn't the root of the problem.

PeterDonis · Dec 21, 2018

Chris Miller said:

There's always one more bug

Reminds me of an old programming joke:

(1) Every program has at least one bug.

(2) Every program can be reduced by at least one line of code.

(3) Prove by induction: every program can be reduced to a single line that doesn't work.

jedishrfu · Dec 21, 2018

Eventually you’ll figure it out, just not in polynomial time. It’s a lesser version of the PvNP as you saunter through your code trying this hack and that until the DUH moment when you discovered you shot yourself in the foot and were the actual root cause of the problem. Hopefully, you or the code will be retire before that time come.

Chris Miller · Dec 21, 2018

Not so sure, on both points, jedishrfu. I'm not sure I'll ever track this one down, and if I do, it'll be more of a "Seriously!?" than a "Duh" moment. I've shot myself in the foot enough times to know the feeling, and it's not usually hard to ascertain the cause. Got bit in the ass similarly when porting to the Blackberry tablet. Very hard to track down a SIGSEGV (memory violation) occurring in the read(). Turned out something had screwed with the signal tables, mapping a common SIGINT onto the SIGSEGV. And the culprit turned out to be a startup call to one of the 4GL library initialization functions. Sometimes I wish they'd stuck with Qnx 2.

Chris Miller · Dec 21, 2018

@PeterDonis : The two assertions seem mutually exclusive. Unless one assumes a program of zero lines could still contain a bug.

PeterDonis · Dec 21, 2018

Chris Miller said:

The two assertions seem mutually exclusive. Unless one assumes a program of zero lines could still contain a bug.

Yes, that's why it's a joke instead of a serious theorem.

Chris Miller · Dec 23, 2018

actually, I thought about it some more, and removing the final line of code does leave your with a non working program. Proof complete.

harborsparrow · Jan 21, 2019

There are so many possibles; many good suggestions above. When you document your hack, you can put a smile on the face of some future programmer. My predecessor would sometimes conclude such in-line code comments with "And BTW, did I mention that (language being used) sucks?" It made me smile more than once.

Chris Miller · Jan 21, 2019

I tend to go with defines, like
#define STUPID_KLUDGE_TO_FIX_PROBABLE_HDD_BUG

Nugatory · Jan 21, 2019

Chris Miller said:

But the weirdest thing about it for me is what "fixes" it. Instead of reading n bytes directly into the buffer passed to my calling/cover function, I malloc a pointer to n bytes, read into that, memcpy these n bytes into the buffer passed to my cover function, and free the mallcoed pointer. Then all tests work perfectly. To me, this is a complete no-op, and seems to rule out memory corruption and compiler errors (which I've tested for to a fair extent anyway). I'd be curious to hear anyone's thoughts on this.

Many years ago (the processor was an Intel486 if you want a quantitative sense of what "many" means here) I met a bug that had the same footprint: read into a buffer wouldn't work, reading into a different memory buffer and copying into the desired buffer would.

The BIOS was improperly initializing the memory controller so that DMA transfers into the region of memory containing the buffer were not being snooped by the processor L2 cache. Thus if something from that region was in the cache before the read the processor wouldn't see the effect of the read - and if it was dirty in the cache the processor would even helpfully copy the old data back into memory when it snooped a read by another processor. Allocate a buffer outside of the poisoned region and DMA worked fine; and of course a memcpy into the poisoned region worked because processor access was properly snooped. And obviously the behavior would be exquisitely random because it depended on the cache footprint and pattern of past memory accesses.

We actually found it pretty quickly (about three days) because it was a multi-processor system and we were already predisposed to blame weirdnesses on cache coherency problems.

Not a helpful answer, but you've given me a chance to tell a war story...

Chris Miller · Jan 22, 2019

Actually, that's both interesting and helpful. My first PC (not counting C64) sported the 486 (w/ built-in floating pt!). No hard drive, but dual floppies! Cost about 4G.

Tom.G · Jan 22, 2019

Chris Miller said:

Cost about 4G.

I hope you mean 4 Grand, not 4 Giga, dollars. :))

If the latter, I want you as a new Best-Friend-Forever!

Cheers,
Tom

Chris Miller · Jan 23, 2019

You like your bffs to be deep in the red? But yeah, only 4 thou for a dual floppy 16 MHz PC running DOS (no Windows), and this was back when you could almost buy a new Honda Civic for that. So, say about 30K in today's dollars.

Unusual Bug: 40 Years of Programming and a Read() Failure

FAQ: Unusual Bug: 40 Years of Programming and a Read() Failure

1. What is the "Unusual Bug: 40 Years of Programming and a Read() Failure"?

2. What causes the "Unusual Bug: 40 Years of Programming and a Read() Failure"?

3. Can the "Unusual Bug: 40 Years of Programming and a Read() Failure" be fixed?

4. Have there been any notable instances of the "Unusual Bug: 40 Years of Programming and a Read() Failure"?

5. Is there ongoing research being conducted on the "Unusual Bug: 40 Years of Programming and a Read() Failure"?

Similar threads

Hot Threads

Recent Insights