On Tue, Jan 20, 2004 at 08:07:31AM -0500, Dave Goldblatt wrote: > Gregory Gulik wrote: > > > > >F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD > >1 D 0 11952 1 0 80 2 - 1341 wait_o ? 00:00:00 gtar > > That's what I suspected - the app is in an I/O wait (technically, > "uninterruptible sleep"). Usually this is due to an NFS hang, although > it can be caused by other media which has wedged but not timed out. What does "lsof" tell you? What IO is the process waiting on? For NFS the mount flag "intr" can help (next time). Umount -k may also kill it this time. You may want to look at a back trace for the process and see what the last IO request was. There are two stacks of interest, the user space side and the kernel space side of the system call. Depending on library support and the precise action inside system calls a read() of ten characters will not return until ten characters are present. A read of a line will not return until a new line marker.... And for some things the end of file is important. Make sure you understand what "gtar" is being asked to do. It's standard input file descriptor may be hung (no new line/eof). It may be hung on IO for a file read or write or scratch file. Has anyone attempted to backup /dev/random, /dev/zero or a named pipe? Devzero took weeks on my old infinity computer and check sums on devrandom never matched. Once a system call dives down deep into the OS the handling of user space signals (kill -N) gets murky. As I noted above NFS has a flag "intr" to permit the interrupt in user space to end the file IO. For NFS this may be possible at multiple levels because timers for network IO trigger and wake NFS up. For other IO devices there may not be a failsafe timer to wake up the driver and return an error. Because "stuck in IO" often translates into stuck in a hardware driver it is important to report the specifics of the device involved and the specifics of the driver. Note that RAID devices are layered. The pseudo hardware that is the raid depends on real hardware. Thus there are multiple places where a 'hang' could be generated. For some IO, make sure things have not gotten so slow that no useful progress is being made. Error recovery, Raid recovery or swap IO can make things look broken. The machine is working, but no progress is being made and each time you "look" the process is stuck in IO. You may be missing baby step IO (but now you should be able to kill it). -- T o m M i t c h e l l mitch48-at-sbcglobal-dot-net