Fedora Users — Bizzaro system lockup/hang, possible kernel issue.

Hi all,

We have a range of FC5 boxes (~4 with the problem) and from roughly
three/four weeks ago we've been seeing intermittent lockups/hangs which
often result in having to power-cycle. The boxes are currently running
2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different
hardware all display the same issue. The filesystem is ResiserFS.

Disk operations will stall, commands like 'free' will (might) return
quickly, but 'uptime' could take 5 minutes to complete. The system is
essentially unusable.

Fork rate and # of procs seems to skyrocket however, not sure if this is
just monitoring going strange because of the underlying problem though.

I've finally managed to capture the issue in more detail and was hoping
somebody had a clue, here I perform an "strace -tt ls /var" :

...
12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072)
= 679
12:01:28.034163 close(4)                = 0
12:01:28.034267 munmap(0xb7cda000, 131072) = 0
12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755,
st_size=328, ...}) = 0
12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0,
0) = -1 EOPNOTSUPP (Operation not supported)
12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755,
st_size=72, ...}) = 0
12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0,
0) = -1 EOPNOTSUPP (Operation not supported)
12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700,
st_size=80, ...}) = 0
12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access", 
...

You can see that on the lstat64 to /var/tomcat4 the timestamp jumps
back, in actuality that sys call too about 40 seconds to complete.

I have done this a couple of times on different areas of the disk and
the results are the same, lstat64 is hanging for extremely long periods.

Another side effect of this is the system time becomes skewed during the
hang on an lstat (probably other calls do this but I've not been able to
trace enough ).

On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if
that helps, on another I've reverted all the way back to
kernel-2.6.16-1.2108_FC4.i686.rpm.

I've run the smartctl utility to check the disk is ok and that has
passed on all servers.  I'll wait and see the results of my kernel
updates/regressions.