Fedora Users — Re: df hangs on down nfs server mounted with hard,intr, can't kill

Ron Herardian wrote:

"On a hard-mounted file system, NFS operations are retried until they are acknowledged by the server. A side effect of hard-mounting NFS file systems is that processes block (or "hang") in a high-priority disk wait state until their NFS RPC calls complete. If an NFS server goes down, the clients using its file systems hang if they reference these file systems before the server recovers. Using -intr in conjunction with the -hard mount option allows users to interrupt system calls that are blocked waiting on a crashed server. The system call is interrupted when the process making the call receives a signal, usually sent by the user typing Ctrl-C or using the kill command.

Yep, in the man page too. That would imply that the mount commands listed below which include "hard,intr" would allow one to send a signal (ctrl-C or killall or kill -9) and terminate the process. However, with Fedora and the below listed kernel, I could not kill the task.

On a soft-mounted file system, an NFS RPC call returns a timeout error if it fails the number of times specified by the retrans option. You should not use the -soft option on any file system that is writeable, nor on any file system from which you load executables. NFS only guarantees the consistency of data after a server crash if the NFS file system was hard-mounted by the client."

This is a very good point....  Thanks.

[http://www.brandonhutchinson.com/nfs_timeouts.html]

Wade Hampton wrote:

I have a Fedora server with kernel 2.4.22-1-2163 SMP mounting a
remote solaris server (hence choice of options):

  rsize=32768,ro,hard,intr,tcp,nfsvers=3

When the remote is down or disconnected, a "df" hangs (as expected),
but I can't kill it, even as root or with kill -9.  The docs for mount
indicate that the INTR option should allow for killing apps mounted
with HARD.

I also coded a test program that calls statvfs(2) and it hangs in the
on the statvfs(2) call when run against a down NFS server.  It too
can't be interrupted or killed.

My questions are:

1)  Is there a safe and reliable means to check for a down NFS server
    (e.g., is showmount -e <server> safe enough -- it is interruptable
    hence one could wrap this with a timer and it you timeout, the
    server would be down)?

2)  Is the non-interruptable operation (even with INTR option)
    a bug or feature?

3)  Is there a simple kernel call, /proc entry, or similar that can
   be used for this purpose?

4)  Is there a perl module to accomplish this?

This would be very useful for network monitoring, e.g., when the
server goes down and stays down for >1 minute, generate an SNMP
trap and write to a log file.  It would be good if you can't put an SNMP
agent on the server, but only on the client.  It is also useful for writing
a highly reliable client application.

As I have no control over the remote system, when it went down,
I had to do a hard reboot of my Linux box to stop the hung apps.  This
is a Windows solution, not a Linux solution

Note, I found this when writing some scripts for MRTG to check
the disk utilization of partitions.  My df's hung so I didn't even get
the proper values for my local partitions.  After a few days, I had
LOTS of hung MRTG apps.

Thanks
--
Wade Hampton

-- fedora-list mailing list fedora-list@xxxxxxxxxx To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list