Hello Howard,
OK, within 12 hours after startup of the new machine running identical
software that the other slower machines are running with the exact same
data feed, I get
kernel: journal commit I/O error
I can log in, but can't do commands. A manual power-down (shutdown -r now
won't work) and reboot clears it fine.
First I suspected a hard drive error on both machines. But then
replacement hard drives came in. It seemed to stop the problem for a few
days, so I closed a bugzilla I had. Nope, this weekend, it went back to
crashing every 4-18 hours.
I tried to cut the read-writes in half, to no effect, by reducing the
amount of data/files coming in.
I have:
Replaced the hard drive 3 times with new ones (to no avail)
Reduced the read/writes by around half
Turned off legacy USB support, which also caused my keyboard and mouse to
stop working with errors (that's been cleared and is OK)
Filed a bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=318661
Tonight, I tried using the original kernel that came with F7
(2.6.21-1.3194.fc7) instead of the latest (2.6.22.9-91.fc-7).
As of two hours into this, so far so good, but I'm not confident.
Two other machines, Pentium 4's at 3 GHZ with ASUS motherboards, purr like
a kitten.
Has anyone seen anything like this, or know what could be the problem?
As always, grateful for any help, and thanks for reading this!
Gilbert
*******************************************************************************
Gilbert Sebenste
********
(My opinions only!) ******
*******************************************************************************
I would suspect a hardware issue with the motherboards as my first port of
call. I have had a similar problsm with a new Pentium 4 board recently where
the ATA disc interface offlined every 18 hours of so but hvaing replaced with
a SATA drive the system purrs for weeks.
On two new PC's? Showing identical symptoms? I find that hard to believe.
But on the other hand...
Secondly the kernel version may be important - core 2 quad processors are
newish so later kernel SHOULD have better support. Maybe try a development
kernel on one of the machines e.g. 2.6.23.-----
This is what I am wondering...if it *is* the kernel, udev, or something
like that. This thing has 2 gb/sec throughput...it shouldn't be doing
this.
Finally, have you run a full FSCK on the drives after they fail - reboot into
single mode and run fsck -f. You may find that the problem is a disc
structure corruption ... then you have to find out why.
I need to do that...thanks for the reminer.
You do not say which journalling file system you are using - is this ext3,
jfs, reiserfs, ...
ext3.
Finally, have you run memtest86+ on these machines - possible memory dropout
going unnoticed (especially if they do not have ECC memory)
Not yet. But I can tell you "top" gives the full 4 GB it says I have. Of
course, that doesn't mean much. Again, I find it very difficult to believe
that two machines will have this problem. That said, I'm not ruling out
anything.
> Note sure if this will help but hope it is not just noise.... >
No, it helped, thanks. Any other suggestions, I'll take them.
*******************************************************************************
Gilbert Sebenste ********
(My opinions only!) ******
*******************************************************************************