Ron Herardian wrote:
"On a hard-mounted file system, NFS operations are retried until they are acknowledged by the server. A side effect of hard-mounting NFS file systems is that processes block (or "hang") in a high-priority disk wait state until their NFS RPC calls complete. If an NFS server goes down, the clients using its file systems hang if they reference these file systems before the server recovers. Using -intr in conjunction with the -hard mount option allows users to interrupt system calls that are blocked waiting on a crashed server. The system call is interrupted when the process making the call receives a signal, usually sent by the user typing Ctrl-C or using the kill command.Yep, in the man page too. That would imply that the mount commands listed below
which include "hard,intr" would allow one to send a signal (ctrl-C or killall or kill -9)
and terminate the process. However, with Fedora and the below listed kernel,
I could not kill the task.
On a soft-mounted file system, an NFS RPC call returns a timeout error if it fails the number of times specified by the retrans option. You should not use the -soft option on any file system that is writeable, nor on any file system from which you load executables. NFS only guarantees the consistency of data after a server crash if the NFS file system was hard-mounted by the client."
This is a very good point.... Thanks.
[http://www.brandonhutchinson.com/nfs_timeouts.html]
Wade Hampton wrote:
I have a Fedora server with kernel 2.4.22-1-2163 SMP mounting a remote solaris server (hence choice of options):
rsize=32768,ro,hard,intr,tcp,nfsvers=3
When the remote is down or disconnected, a "df" hangs (as expected), but I can't kill it, even as root or with kill -9. The docs for mount indicate that the INTR option should allow for killing apps mounted with HARD.
I also coded a test program that calls statvfs(2) and it hangs in the on the statvfs(2) call when run against a down NFS server. It too can't be interrupted or killed.
My questions are:
1) Is there a safe and reliable means to check for a down NFS server (e.g., is showmount -e <server> safe enough -- it is interruptable hence one could wrap this with a timer and it you timeout, the server would be down)?
2) Is the non-interruptable operation (even with INTR option) a bug or feature?
3) Is there a simple kernel call, /proc entry, or similar that can be used for this purpose?
4) Is there a perl module to accomplish this?
This would be very useful for network monitoring, e.g., when the server goes down and stays down for >1 minute, generate an SNMP trap and write to a log file. It would be good if you can't put an SNMP agent on the server, but only on the client. It is also useful for writing a highly reliable client application.
As I have no control over the remote system, when it went down, I had to do a hard reboot of my Linux box to stop the hung apps. This is a Windows solution, not a Linux solution
Note, I found this when writing some scripts for MRTG to check the disk utilization of partitions. My df's hung so I didn't even get the proper values for my local partitions. After a few days, I had LOTS of hung MRTG apps.
Thanks -- Wade Hampton
--
fedora-list mailing list
fedora-list@xxxxxxxxxx
To unsubscribe: http://www.redhat.com/mailman/listinfo/fedora-list