On Mon, 8 Oct 2007, Gilbert Sebenste wrote:
Hello all,
I am having an absolutely vexing problem that maybe somebody might shed some
light on.
I just got 2 new computers, both running F7. They each have one Seagate 750
GB SATA 3 Gb/s, 7200 RPM, 16 MB drive. Each machine has 4 GB of RAM, Core 2
quad 6700 motherboard from ASUS.
OK. I run the computers pretty hard. But I have two Pentium 4's who work just
as hard, all getting a 20 MB/sec peak (1 MB/sec avg) weather feed from the
National Weather Service, flawlessly for months until I install new kernels
on it and reboot.
The P4 has been around for years, so that type of system has been pretty
well tested.
OK, within 12 hours after startup of the new machine running identical
software that the other slower machines are running with the exact same data
feed, I get
kernel: journal commit I/O error
I can log in, but can't do commands. A manual power-down (shutdown -r now
won't work) and reboot clears it fine.
First I suspected a hard drive error on both machines. But then
replacement hard drives came in. It seemed to stop the problem for a few
days, so I closed a bugzilla I had. Nope, this weekend, it went back to
crashing every 4-18 hours.
I tried to cut the read-writes in half, to no effect, by reducing the
amount of data/files coming in.
I have:
Replaced the hard drive 3 times with new ones (to no avail)
Reduced the read/writes by around half
Turned off legacy USB support, which also caused my keyboard and mouse to
stop working with errors (that's been cleared and is OK)
Filed a bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=318661
Tonight, I tried using the original kernel that came with F7
(2.6.21-1.3194.fc7) instead of the latest (2.6.22.9-91.fc-7).
As of two hours into this, so far so good, but I'm not confident.
Two other machines, Pentium 4's at 3 GHZ with ASUS motherboards, purr like a
kitten.
Has anyone seen anything like this, or know what could be the problem?
As always, grateful for any help, and thanks for reading this!
Don't assume the problem is related to your heavy disk I/O. Try some
other workloads. I like to run a suite of benchmarks on new hardware.
They often reveal problems with the initial setup, and are helpful
later on when something seems broken, e.g., why did the last kernel
update cause disk I/O to slow by 50%?
Are you using x86_64 kernels? I suspect most people with similar
workloads will be using x86_64, so you may be encountering problems
specific code that hasn't been thoroughly exercises on i386 kernels. In
the past, there have been problems with RH's 4k stack size, particularly
during error handling, that can mask the real source of the problem.
If you are really stuck with 32-bit kernels, you might try the 16k
versions from linuxant.
--
George N. White III <aa056@xxxxxxxxxxxxxx>