On Fri, 2006-06-09 at 14:40 +0900, Naoki wrote: > Hi all, > > We have a range of FC5 boxes (~4 with the problem) and from roughly > three/four weeks ago we've been seeing intermittent lockups/hangs which > often result in having to power-cycle. The boxes are currently running > 2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different > hardware all display the same issue. The filesystem is ResiserFS. > > Disk operations will stall, commands like 'free' will (might) return > quickly, but 'uptime' could take 5 minutes to complete. The system is > essentially unusable. > > Fork rate and # of procs seems to skyrocket however, not sure if this is > just monitoring going strange because of the underlying problem though. > > I've finally managed to capture the issue in more detail and was hoping > somebody had a clue, here I perform an "strace -tt ls /var" : > > ... > 12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072) > = 679 > 12:01:28.034163 close(4) = 0 > 12:01:28.034267 munmap(0xb7cda000, 131072) = 0 > 12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755, > st_size=328, ...}) = 0 > 12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0, > 0) = -1 EOPNOTSUPP (Operation not supported) > 12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755, > st_size=72, ...}) = 0 > 12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0, > 0) = -1 EOPNOTSUPP (Operation not supported) > 12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700, > st_size=80, ...}) = 0 > 12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access", > ... > > You can see that on the lstat64 to /var/tomcat4 the timestamp jumps > back, in actuality that sys call too about 40 seconds to complete. > > I have done this a couple of times on different areas of the disk and > the results are the same, lstat64 is hanging for extremely long periods. > > Another side effect of this is the system time becomes skewed during the > hang on an lstat (probably other calls do this but I've not been able to > trace enough ). > > On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if > that helps, on another I've reverted all the way back to > kernel-2.6.16-1.2108_FC4.i686.rpm. > > I've run the smartctl utility to check the disk is ok and that has > passed on all servers. I'll wait and see the results of my kernel > updates/regressions. Happened to another server. Not one of the above mentioned with replaced kernels, but this once also with 2.6.16-1.2122_FC5. # date; ls -l /var ; date Fri Jun 9 18:27:36 JST 2006 total 3 drwxr-xr-x 10 root root 264 May 17 11:08 cache drwxr-xr-x 3 root root 72 Feb 12 02:16 db drwxr-xr-x 3 root root 72 Feb 12 02:16 empty drwxr-xr-x 7 vcp vcp 200 Dec 15 12:12 jsp drwxr-xr-x 17 root root 480 May 17 11:20 lib drwxr-xr-x 2 root root 48 Feb 12 02:16 local drwxrwxr-x 6 root lock 144 Jun 9 05:12 lock drwxr-xr-x 9 root root 1896 Jun 4 05:24 log lrwxrwxrwx 1 root root 10 May 17 10:55 mail -> spool/mail drwxr-x--- 4 root named 96 Apr 19 23:12 named drwx------ 2 root root 80 May 27 10:10 net-snmp drwxr-xr-x 2 root root 48 Feb 12 02:16 nis drwxr-xr-x 2 root root 48 Feb 12 02:16 opt drwxr-xr-x 2 root root 48 Feb 12 02:16 preserve drwxr-xr-x 15 root root 696 Jun 9 05:12 run drwxr-xr-x 13 root root 328 Feb 12 02:16 spool drwxrwxrwt 2 root root 48 Jun 9 05:12 tmp drwxr-xr-x 3 root root 72 Nov 18 2003 tomcat4 drwxr-xr-x 6 root root 144 Feb 12 08:12 www drwxr-xr-x 3 root root 128 May 17 11:11 yp Fri Jun 9 18:27:36 JST 2006 Notice the time didn't change, but immediately after it printed the first date/time it then hung for 30 seconds before the 'ls' output was printed. Then I kept running the 'date' command and you can see what's happening : [root@banner8 ~]# date Fri Jun 9 18:27:37 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:38 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:39 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:36 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:36 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:37 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:36 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:37 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:39 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:36 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:37 JST 2006 [root@banner8 ~]# date Fri Jun 9 18:27:38 JST 2006 Anybody seen anything like _that_ before? It is running ntpd.