So, I have a relatively new system on which I am seeing strange NFS
behavior.
In short I am getting seemingly random errors in files written via NFS.
* I do not get the errors if I write files locally.
* I have no errors in the NIC, I even tried a second NIC in a PCI
slot as opposed to the onboard one. There are no errors recorded
on the NIC or the switch on a 1Gb port.
* I see no memory errors, I ran memtest for 3 days clean.
* To test I am using dd if=/dev/zero of various (large) file sizes.
* Since I know that the file should be all zeros I wrote a C program
to read it back and tell me where it finds non-zero bytes. The
program results are confirmed with od.
* The files read back always have the errors in the same place, so
it is not a problem with reading the files.
* There are no errors in any logs.
* The problem occurs on both the RAID1 (ext3) and RAID10 (xfs)
filesystems.
* I've tried two clients, both FC5 one 64bit, and the other a 32 bit
with the same results. This error was uncovered by users
attempting to write files from other systems and other Fedora
releases, so it is repeatable regardless of the client.
* the server is not running anything else and spends a large portion
of the time idle. loadaverages are quite low. swap is mostly
unused. a large portion of RAM is allocated to file cache, but I
expect that this would be normal for this amount of file IO.
The server is running an up-to-date FC6, although this also occurred
with FC5. I am about to try F7.
Hardware is an AMD 1220 dual core 64bit, on a Tyan K8SSA S3950
with an Adaptec Raid 2230SLP and 7 Fujitsu MAU3147NC.
The RAID config is that 1 disks (on diff channels) are in a Mirror for
the OS, 4 are in a Raid 10 config and 1 is a hot spare.
Anyone ever seen anything like this before?
Suggest where I might look next?
Additional tests?