Re: If not readdir() then what?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 11, 2007 at 07:15:57AM +1000, Neil Brown wrote:
> > Neil, is this correct?
> 
> It's just NFS.  nfsd does open/getdents/close on every readdir
> request.

That's... unfortunate.

Ext3 w/htree enable creates an in-memory red-black tree of a portion
of the directory; this is typically one filesystem block worth of
entries, but if there is a hash collision which spells across a
directory block, we will fill the red-block tree until we have read in
all of the directory entries that collide on a particular hash value
into the red-black tree.

We then maintain an internal pointer which provides us with exactly
which directory entries we have and haven't returned via
getdents/readdir via a pointer referenced off the file pointer.  So
long as the user never changes filp->f_pos, we can meet the POSIX
guarantees about returning each file (that hasn't been deleted or
added since the last opendir or rewinddir) once and exactly once.  If
the user calls telldir/seekdir, we can no longer make this guarantee
since the telldir cookie can't encapsulate the location inside our AVL
tree.  (We would need at least 96 bits in order to make that
guarantee; 64 bits of hash, plus a 32-bit count value in the hash
collision chain.)  The relevant code which does this is in
ext3_dx_readdir:

#define hash2pos(major, minor)	(major >> 1)
#define pos2maj_hash(pos)	((pos << 1) & 0xffffffff)
#define pos2min_hash(pos)	(0)

	/* Some one has messed with f_pos; reset the world */
	if (info->last_pos != filp->f_pos) {
		free_rb_tree_fname(&info->root);
		info->curr_node = NULL;
		info->extra_fname = NULL;
		info->curr_hash = pos2maj_hash(filp->f_pos);
		info->curr_minor_hash = pos2min_hash(filp->f_pos);
	}

(Note, we're only using a 31-bit telldir cookie because of NFSv2
compatibility.  If we could know we're using NFSv3/v4, by via checking
O_LARGEFILE, we could easily supply a 63-bit telldir cookie.  As I
recall there were some signed/unsigned issues which is why we had to
limit ourselves to a 31-bit/63-bit value).


So the problem with nfsd doing an open/getdents/close on every readdir
request is two-fold.  First of all, it means that you are incurring
extra overhead by forcing ext3 to allocate, create, and free the
red-black tree on every single NFS readdir request.  Secondly, it
increases the chances of problems with the hash collision logic.

Yes, right now we're probably courting danger if we try to operate on
a directory with hash collisions because it means that multiple
dirents returned by the NFS server to the NFS client will have the
same NFS readdir cookie (currently supplied by filp->f_pos/llseek).
But if most NFS clients typically use the cookie associated with the
last dirent returned by the readdir RPC, we might be getting away with
potential hash collisions --- and if nfsd were to cache open directory
fd's for some short period of time, we might get away with it even
more, since if the NFS client does a linear readthrough of the
dirents, we might end up winning even though the results might be a
little cheasy.  

It is true that if the NFS client tries to use a telldir cookie
associated with some directory entry not at the end of the returned
block of dirents, and the directory entry is part of a set of
filenames that result in a hash collision, we could end up losing.
But I'm guessing this happens very rarely in practice.

One of the things I did when I was debugging this code way back when
was that I had a debugging hash type which always returned the value
42.  This allowed me to be sure that all of the edge conditions
involving hash collisions could be well tested, and modulo
telldir/seekdir simply not working, I was able to verify that the
filesystem worked correctly, albeit not very efficiently.  I didn't
test to see what would happen with exporting such a filesystem via
NFS.  It might be interesting to try that and then see how many NFS
clients were able to deal with such a filesystem export.  

Given that we haven't gotten any complaints of clients stuck in an
infinite readdir loop, I have to conclude that either NFS clients are
a lot more robust/predictable than we are giving them credit for, or
people aren't exporting via NFS directories big enough that hash
collisions given a 31-bit cookie value haven't happened yet, which
seems to be rather unbelievable.

Regards,

						- Ted
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux