Re: How to manage shared persistent local caching (FS-Cache) with NFS?

Hi David-

On Dec 5, 2007, at 8:22 PM, David Howells wrote:

Chuck Lever <[email protected]> wrote:
I don't see how persistent local caching means we can no longerignore (a)
and (b) above.  Can you amplify this a bit?
How about I put it like this. There are two principal problems tobe dealt
with:

 (1) Reconnection.
Imagine that the administrator requests a mount that uses partof a cache.The client machine is at some time later rebooted and theadministrator
     requests the same mount again.
Since the cache is meant to be persistent, the administratoris at libertyto expect that the second mount immediately begins to use thedata that
     the first mount left in the cache.
For this to occur, the second mount has to be able todetermine which partof the cache the first mount was using and request to use thesame piece
     of cache.
To aid with this, FS-Cache has the concept of a 'key'. Eachobject in thecache is addressed by a unique key. NFS currently builds akey to thecache object for a file from: "NFS", the server IP address,port and NFS
     version and the file handle for that file.

Why not use the fsid as well? The NFS client already uses the fsidto detect when it is crossing a server-side mount point. Fsids aresupposed to be stable over server reboots (although sometimes theyaren't, it could be made a condition of supporting FS-cache on clients).

I also note the inclusion of server IP address in the key. For multi-homed servers, you have the same unavoidable cache aliasing issues ifthe client mounts the same server and export via different servernetwork interfaces.

 (2) Cache coherency.
Imagine that the administrator requests a mount that uses partof acache. The administrator then makes a second mount thatoverlaps thefirst, maybe because it's a different part of the same serverexport or
     maybe it uses the same part, but with different parameters.
Imagine further that a particular server file is accessiblethrough bothmountpoints. This means that the kernel, and therefore theuser, has two
     views of the one file.
If the kernel maintains these two views of the files astotally separatecopies, then coherency is mostly not a kernel problem, it's anapplication
     problem - as it is now.
However, if these two views are shared at any level - such asif they
     share an FS-Cache cache object - then coherency can be a problem.

Is it a problem because, if there are multiple copies of the sameremote file in its cache, then FS-cache doesn't know, uponreconnection, which item to match against a particular remote file?

I think that's actually going to be a fairly typical situation --you'll have conditions where some cache items will become orphaned,for example, so you're going to have to deal with that ambiguity as apart of normal operation.

For example, if the FS-caching client is disconnected or powered offwhen a remote rename occurs that replaces a file it has cached, theclient will have an orphaned item left over. Maybe this use case isonly a garbage collection problem.

The two simplest solutions to the coherency problem are (a) toenforcesharing at all levels (superblocks, inodes, cache objects),(b) to enforcenon-sharing. In-between states are possible, but are muchtrickier and
     more complex.
Note that cache coherency management can't be entirelyavoided: uponreconnection a cache object has to be checked against theserver to see
     whether it's still valid.


How do you propose to do that?

First, clearly, FS-cache has to know that it's the same object, sofsid and filehandle have to be the same (you refer to that as the"reconnection problem", but it may generally be a "cache aliasingproblem").

I assume FS-cache has a record of the state of the remote file whenit was last connected -- mtime, ctime, size, change attribute (I'llrefer to this as the "reconciliation problem")? Does it, forinstance, checksum both the cache item and the remote file to detectdata differences?

You have the same problem here as we have with file system searchtools such as Beagle. Reconciling file contents after a reconnectionevent may be too expensive to consider for NFS, especially if a fileis terabytes in size.

Note that both these problems only really exist because the cache is
persistent between mounts. If it were volatile between mounts,then (1) would not exist, and (2) can be ignored as it is now.

Do you allow administrators to select whether the FS-cache ispersistent? Or is it always unconditionally persistent?

An adequate first pass at FS-cache can be done without guaranteeingpersistence. There are a host of other issues that need exposure --steady-state performance; cache garbage collection and reclamation;cache item aliasing; whether all files on a mount point should becached on disk, or some in memory and some on disk; and so on -- thatcan be examined without even beginning to worry about reboot recovery.

And what would it harm if FS-cache decides that certain items in itscache have become ambiguous or otherwise unusable after areconnection event, thus it reclaims them instead of re-using them?

There are three obvious ways of dealing with the problems (ignoringthe fact
that all cases have on-reconnection coherency to deal with whatever):

 (a) No sharing at all.
Cache coherency is what it is now with NFS, but reconnectionmust bemanaged. A key must be generated to each mount to distinguishthat mount
     from an overlapping mount that might contain the same files.
These keys must be unique (and uniqueness must be enforced)unless twosuperblocks are guaranteed disjoint (eg: on differentservers), or areguaranteed to share anyway (eg: exact same parameter sets andnosharecache
     not specified).

 (b) Fully shared.
Cache coherency is a non-issue. Reconnection is a non-issue.Anyparticular server inode is guaranteed to be represented by asingle inodeon the client, both in the superblock and the pagecache, andby a single
     FS-Cache cache object.
The downside of this is that sharing must take priority overdifferentconnection parameters. R/O vs R/W can be dealt relativelyeasily as Ibelieve it's a local phenomenon, and is dealt with before thefilesystem
     is consulted.  There are patches to do this.
(c) Implicit full sharing between cached mountpoints; uncachedmountpoints
     need not be shared.
Cached mountpoints have the properties of (b), uncachedmountpoints are
     left to themselves.
Note that redundant disk usage is undesirable, but unlikely tocause a realproblem, such as an oops. Non-unique keys, on the other hand, area problem.
Having non-shared local inodes sharing cache objects causes evenmore problems,
and I don't want to go there.
Nothing you say in the rest of your proposal convinces me that having
multiple caches for the same export is really more than atheoretical issue.
Okay.  So how do you do reconnection?
The simplest way from what I see is to require that theadministrator specifyeverything, but this is probably not what you want if you'redistributing NFS
mounts by NIS, say.

Automatic configuration is preferred. For example, NFS with Kerberoshas an administrative scaling problem because some localadministration (creating a keytab and registering the client withKDC) is required for every client that joins a realm.

The next simplest way is to bind all the differentiation parameters(seenfs_compare_mount_options()) into a key and use that, plus auniquifier from
the administrator if NFS_MOUNT_UNSHARED is set.

It gives us the proper legacy behavior, but as soon as theadministrator changes a mount option, all previously cached items forthat mount point become orphans.

As useful as the feature is, one can also argue that mounting thesame exportmultiple times is infrequent in most normal use cases.Practically speaking,
why do we really need to worry about it?
Because it's possible. Because it has to be considered. Because,as you said,people do it. Because if I don't deal with it, the kernel willoops when NFS
asks FS-Cache to do something it doesn't support.
I can't just say: "Well, it'll oops if you configure your NFSshares like that,
so don't.  It's not worth me implementing round it.".

What causes that instability? Why can't you insulate against theinstability but allow cache incoherence and aliased cache items?

Local file systems are fraught with cases where they protect theirinternal metadata aggressively at the cost of not keeping the disk upto date with the memory version of the file system.

Similar compromises might benefit FS-cache. In other words, FS-cachefor NFS file systems may be less functional than for, say, AFS, toallow the cache to operate reliably.

The real problem here is that the NFS protocol itself does notsupport strongcache coherence. I don't see why the Linux kernel must fix thatproblem.
So you're arguing there shouldn't be local caching for NFS? Orthat there
shouldn't be persistent local caching for NFS?

I'm arguing that cache coherence isn't supported by the NFS protocol,so how can FS-cache *require* a facility to support persistent localcaching that the protocol doesn't have in the first place?

NFS client implementations do the best they can; there are alwaysscenarios where coherence issues cause behavior no-one expects.Usually NFS clients handle ambiguous cases by invalidating theircaches. Invalidating is cheap for in-memory caches. Frequentinvalidation is going to be expensive for FS-cache, since it requiressome disk I/O (and perhaps even file truncation). One reason whychunk caching is better than whole-file caching is that it bounds thetime and effort to recycle a cache item.

AFS assigns universally unique identities to servers, volumes, andfiles. NFS doesn't guarantee unique identities to servers orexports, and file handles are supposed to be unique only on a givenserver [*]. And unfortunately file handles can be re-used by theserver without any indication to the client that the file handle ithas cached is no longer the same file (see the "out_fileid" label infs/nfs/inode.c:nfs_update_inode). AFS provides client-visiblegeneration IDs in its inode numbers for this case.

Thus NFS itself does not provide any good way to help you sort FS-cache cache items outside of a single export. A proper FS-cacheimplementation thus cannot depend on server/export identity toguarantee the singularity of cache items.

So FS-cache will have a hard time guaranteeing that there is only oneitem in its cache that maps to a given NFS server file. It may alsobe difficult to guarantee that multiple NFS server files do not maponto the same local cache item (file handle re-use).

This suggests to me that the cache aliasing problem is unsolvable forNFS, so you should find a way to make FS-cache work in a world wherecache aliasing is a fact of life.

Lastly, there's already a mount option that allows admins tocontrol whetherthe page and attribute caches are shared -- "sharecache". Isthis mount
option not adequate for persistent caching?
Adequate in what way? It doesn't currently automatically guaranteesharing ofoverlapping superblocks. It merely disables nonsharecache whichexplicitly
disables cache sharing.

The current problem with "sharecache" is that the mount options onsubsequent mounts of the same export are silently ignored. You areproposing the same behavior for FS-cache-managed mount points, whichmeans we're spreading bad UI behavior further.

At least there should be a warning that explains why a file systemthat was mounted with "rw,noac,tcp" is behaving like it's "ro,ac,udp".

Ideally, if we must have cache sharing, the behavior should be: ifthe mount options, the server, and the fsid are the same, then thecache should be shared. If any of that tuple are different, then aunique cache is used for that mount point (within the limits of beingable to determine the unique identity of a server and export).


--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

[*] Section 4 of RFC 3530 states:

The filehandle in the NFS protocol is a per server unique identifierfor a filesystem object. The contents of the filehandle are opaqueto the client. Therefore, the server is responsible for translatingthe filehandle to an internal representation of the filesystem object.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  - From: Chuck Lever <[email protected]>
- How to manage shared persistent local caching (FS-Cache) with NFS?
  - From: David Howells <[email protected]>
- Re: How to manage shared persistent local caching (FS-Cache) with NFS?
  - From: David Howells <[email protected]>

Prev by Date: Re: programs vanish with 2.6.22+
Next by Date: Re: a problem with NETPOLL/KGDBoE
Previous by thread: Re: How to manage shared persistent local caching (FS-Cache) with NFS?
Next by thread: Re: How to manage shared persistent local caching (FS-Cache) with NFS?
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]