On Wed, 2008-03-26 at 11:19 -0400, Tom Horsley wrote: > On Wed, 26 Mar 2008 09:58:50 -0500 > Les Mikesell <lesmikesell@xxxxxxxxx> wrote: > > > What kind of problems do you see? It can be hard to get firewall > > openings right and it depends on uid's matching at the client and server > > for file ownership and permissions, but those things either work right > > or not at all. You shouldn't see reliability or performance problems > > unless you have hundreds of busy clients. > > What I mostly see is every imaginable problem on different machines > at different times :-). > > I think the root cause is related to having vast numbers of different > versions of unix/linux on different machines all of which claim > to "support" NFS, but which together are highly unreliable (especially > the ones too old to support tcp connections). > > The worst problem is data corruption on writes, especially writing > large files across NFS, they will often wind up with large chunks of > zero bytes in place of the actual data. It sounds like you may be running some rather old NFS implementations. Most recent implementations should be able to use TCP which provides error correction. Many years ago I saw problems like you described with writes over NFS/UDP when customers had overloaded ethernets with very high collision rates and/or a machine with a broken ethernet card and/or broken tcp/ip that was transmitting in an unfriendly way. It's been a long time since I've done this, but I used to run tcpdump and write awk or perl scripts that would catch the offending systems and/or disconnect systems from the network until the problem went away. Today, I would likely look at upgrading the systems to use NFS based TCP, but if your network is a mess, it needs to be cleaned up. Also of value, might be the retransmission statistics provided by netstat -s. I think there may also be other statistics available that indicate the error rates on an ethernet interface. You could look at running various type of network switches, or splitting the network into subnets and multihoming major NFS servers. Nataraj > > There is one particular machine (in theory running the same dadgum > version of linux as several others) where some sort of nonsense > persists in always getting stale NFS filehandle messages any time > I try to read specific individual files. I always have to unmount > and remount the filesystem when it gets like this. (Neither system > was down or not talking at any point, just some fiddling of the > files in question, replacing them with symlinks, then suddenly the > stale filehandle messages start). > > The protocols are in theory supposed to support negotiation of the > correct NFS version when connecting to older machines, but that > almost never works, we have to manually fiddle fstab entries to > explicitly give the proper nfsver option or we get things like > the filesystem is "mounted" but all attempts to access files get > errors. > > Herding cats has got to have fewer irritations than using NFS :-). >