Re: Thinking outside the box on file systems

On Aug 16, 2007, at 11:09:16, Phillip Susi wrote:

Kyle Moffett wrote:
Let me repeat myself here: Algorithmically you fundamentallyCANNOT implement inheritance-based ACLs without one of thefollowing (although if you have some other algorithm in mind, I'mlistening):(A) Some kind of recursive operation *every* time you change aninheritable permission(B) A unified "starting point" from which you begin *every*access-control lookup (or one "starting point" per useful semanticgrouping, like a namespace).The "(A)" is presently done in userspace and that's what you wantto avoid. As to (B), I will attempt to prove below that youcannot implement "(B)" without breaking existing assumptions andrestricting a very nice VFS model.
No recursion is needed because only one acl exists, so that is theonly one you need to update. At least on disk. Any cached acls inmemory of descendant objects would need updated, but the number ofthose should be relatively small. The starting point would be thedirectory you start the lookup from. That may be the root, or itmay be some other directory that you have a handle to, and thus,already has its effective acl computed.

Problem 1: "updating cached acls of descendent objects": How do youfind out what a 'descendent object' is? Answer: You can't withoutrecursing through the entire in-memory dentry tree. Such recursionis lock-intensive and has poor performance. Furthermore, you have todo the entire recursion as an atomic operation; other cross-directoryrenames or ACL changes would invalidate your results halfway throughand cause race conditions.

Oh, and by the way, the kernel has no real way to go from a dentry toa (process, fd) pair. That data simply is not maintained because itis unnecessary and inefficent to do so. Without that data you*can't* determine what is "dependent". Furthermore, even if youcould it still wouldn't work because you can't even tell which paththe file was originally opened via. Say you run:

  mount --bind /mnt/cdrom /cdrom
  umount /mnt/cdrom

Now any process which had a cwd or open directory handle in "/cdrom"is STILL USING THE ACLs from when it was mounted as "/mnt/cdrom". Ifyou have the same volume bind-mounted in two places you can't easilydistinguish between them. Caching permission data at the vfsmountwon't even help you because you can move around vfsmounts as long asthey are in subdirectories:

  mkdir -p /a/b/foo
  mount -t tmpfs tmpfs /a/b/foo
  mv /a/b /quux
  umount /quux/foo

At this point you would also have to look at vfsmounts during yourrecursive traversal and update their cached ACLs too.

Problem 2: "Some other directory that you have a handle to": Whenyou are given this relative path and this cwd ACL, how do youdetermine the total ACL of the parent directory:

path: ../foo/bar
cached cwd total-ACL:
  root rwx (inheritable)
  bob rwx (inheritable)
  somegroup rwx (inheritable)
  jane rwx
".." partial-ACL
  root +rwx (inheritable)
  somegroup +rx (inheritable)

Answer: you can't. For example, if "/" had the permission 'root+rwx (inheritable)', and nothing else had subtractive permissions,then the "root +rwx (inheritable)" in the parent dir would be a no-op, but you can't tell that without storing a complete parentdirectory history.

Now assume that I "mkdir /foo && set-some-inheritable-acl-on /foo &&mv /home /foo/home". Say I'm running all sorts of X apps and GIT anda number of other programs and have some conservative 5k FDs open on /home. This is actually something I've done before (without theACLs), albeit accidentally. With your proposal, the kernel wouldfirst have to identify all of the thousands of FDs with cached ACLdata across a very large cache-hot /home directory. For each FD, itwould have to store an updated copy of the partial-ACL states downits entire path. Oh, and you can't do any other ACL or renameoperations in the entire subtree while this is going on, because thatwould lead to the first update reporting incorrect results and racingwith the second. You are also extremely slow, deadlock-prone, andmemory hungry, since you have to take an enormous pile of dentrylocks while doing the recursion. Nobody can even open files withrelative paths while this is going on because the cached ACLs are inan intermediate and inconsistent state: they're updated but thedirectory isn't in its new position yet.

Unsolvable problems with each option:
(1.a.I)
You just broke all sorts of chrooted daemons. When I start bindin its chroot jail, it does the following:
  chdir("/private/bind9");
  chroot(".");
  setgid(...);
  setuid(...);
The "/private" directory is readable only by root, since root isthe only one who will be navigating you into these chroots for anyreason. You only switch UID/GID after the chroot() call, at whichpoint you are inside of a sub-context and your cwd is fullyaccessible. If you stick an inheritable ACL on "/private", thenthe "cwd" ACL will not allow access by anybody but root and mybind won't be able to read any config files.
If you want the directory to be root accessible but the filesinside to have wider access then you set the acl on the directoryto have one ace granting root access to the directory, and one acethat is inheritable granting access to bind. This latter ace doesnot apply to the directory itself, only to its children.

This is completely opposite the way that permissions currentlyoperate in Linux. When I am chrooted, I don't care about thepermissions of *anything* outside of the chroot, because it simplydoesn't exist. Furthermore you still don't answer the "computing ACLof parent directory requires lots of space" problem.

You also break relative paths and directory-moving. Say a processdoes chdir("/foo/bar"). Now the ACL data in "cwd" is appropriatefor /foo/bar. If you later chdir("../quux"), how do you unapplythe changes made when you switched into that directory? Forinheritable ACLs, you can't "unapply" such an ACL state changeunless you save state for all the parent directories, except...What happens when you are in "/foo/bar" and another process does"mv /foo/bar /foobar/quux"? Suddenly any "cwd" ACL data you haveis completely invalid and you have to rebuild your ACLs fromscratch. Moreover, if the directory you are in was moved to aportion of the filesystem not accessible from your currentnamespace then how do you deal with it?
Yes, if /foo/quux is not already cached in memory, you would haveto walk the tree to build its acl. /foo should already be cachedin memory so this work is minimal. Is this so horrible of a problem?
As for moving, it is handled the same way as any other event thatmakes cwd go away, such as deleting it or revoking your access; cwdis now invalid.

No, you aren't getting it: YOUR CWD DOES NOT GO AWAY WHEN YOU MOVEIT OR UMOUNT -L IT. NEITHER DO OPEN DIRECTORY HANDLES. Sorry foryelling but this is the crux of the point I am trying to make. Anypermissions system which cannot handle a *completely* discontiguousfilesystem space cannot work on Linux; end of story. The primaryreason behind that is all sorts of filesystem operations areinternally discontiguous because it makes them much more efficient.By attempting to "force" the VFS to pretend like everything iscontiguous you are going to break horribly in a thousand differentcorner cases that simply don't exist at the moment.

For example:
NS1 has the / root dir of /dev/sdb1 mounted on /mnt
NS2 has the /bar subdir of /dev/sdb1 mounted on /mnt
Your process is in NS2 and does chdir("/mnt/quux"). A user in NS1does: "mv /mnt/bar/quux /mnt/quux". Now your "cwd" is in adirectory on a filesystem you have mounted, but it does notcorrespond *AT ALL* to any path available from your namespace.
Which would be no different than if they just deleted the entirething. Your cwd no longer exists.

No, your cwd still exists and is full of files. You can stillnavigate around in it (same with any open directory handle). You canstill open files, chdir, move files, etc. There isn't even a way forthe process in NS1 to tell the processes in NS2 that its directorieswere rearranged, so even a simple "NS1# mv /mnt/bar/a/somedir /mnt/bar/b/somedir" is not going to work.

Another example:
Your process has done dirfd=open("/media/cdrom/somestuff") whenthe admin does "umount -l /media/cdrom". You still have the CD-ROM open and accessible but IT HAS NO PATH. It isn't even mountedin *any* namespace, it's just kind of dangling waiting for itslast users to go away. You can still do fchdir(dirfd), openat(dirfd, "foo/bar", ...), open("./foo"), etc.
What's this got to do with acls? If you are asking what effect theumount thas on the acls of the cdrom, the answer is none. The aclsare on the disc and nothing on the disc has changed.

But you said above "Yes, if /foo/quux is not already cached inmemory, then you would have to walk the tree to build it's ACL". Nowassume that instead of "/foo/quux", you are one directory deep in thenow-unmounted CDROM and you try to open "../baz/quux". In order toget at the ACL of the parent directory it has to have an absolutepath somewhere, but at that point it doesn't.

No, this is correct because in the root directory "/", the ".."entry is just another link to the root directory. So the absolutepath "/../../../../../.." is just a fancy name for the rootdirectory. The above jail-escape-as-root exploit is possiblebecause it is impossible to determine whether a directory is or isnot a subentry of another directory without an exhaustive search.So when your "cwd" points to a path outside of the chroot, the onespecial case in the code for the "root" directory does not evermatch and you can "chdir" all the way up to the real root. Youcan even do an fstat() after every iteration to figure out whetheryou're there or not!
Ohh, I see... yes... that is a very clever way for root to misusechroot(). What does it have to do with this discussion?

What it "has to do" is it is part of the Linux ABI and as such youcan't just break it because it's "inconvenient" for inheritableACLs. You also can't make a previously O(1) operation take lots oftime, as that's also considered "major breakage".

With this you just got into the big-ugly-nasty-recursive-behavioragain. Say I untar 20 kernel source trees and then have myprogram open all 1000 available FDs to various directories in thekernel source tree. Now I run 20 copies of this program, one foreach tree, still well within my ulimits even on a conservativebox. Now run "mv dir_full_of_kernel_sources some/new/dir". Theonly thing you can do to find all of the FDs is to iterate downthe entire subdirectory tree looking for open files and updatingtheir contexts one-by-one. Except you have 20,000 directory FDsto update. Ouch.
Ok, so you found a pedantic corner case that is slow. So? And itis still going to be faster than chmod -R.ee

"Pedantic corner case"? You could do the same thing even *WITHOUT*all the processes holding open FDs, you would still have to iterateover the entire in-cache portion of the subtree in order to verifythat there are no open FDs on it. Yet again you would also run intothe problem that we don't have *ANY* dentry-to-filehandle mapping inthe kernel.

To sum up, when doing access control the only values you cansafely and efficiently get at are:
(A)  The dentry/inode
(B)  The superblock
(C)  *Maybe* the vfsmount if those patches get accepted
Any access control model which tries to poke other values is justgoing to have a shitload of corner cases where it just falls over.
If by falls over you mean takes some time, then yes.... so what?

Converting a previously O(1) operation into an O(number-of-subdirs)operation is also known as "a major regression which we don't do arelease till we get it fixed". For boxes where O(number-of-subdirs)numbers in the millions that would make it slow to a painful crawl.

By the way, I'm done with this discussion since you don't seem to bepaying attention at all. Don't bother replying unless you'veactually written testable code you want people on the list to lookat. I'll eat my own words if you actually come up with an algorithmwhich works efficiently without introducing regressions.

Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: Thinking outside the box on file systems
  - From: Phillip Susi <psusi@cfl.rr.com>
- Re: Thinking outside the box on file systems
  - From: Marc Perkel <mperkel@yahoo.com>

References:
- Thinking outside the box on file systems
  - From: Marc Perkel <mperkel@yahoo.com>
- Re: Thinking outside the box on file systems
  - From: alan <alan@clueserver.org>
- Re: Thinking outside the box on file systems
  - From: Michael Tharp <gxti@partiallystapled.com>
- Re: Thinking outside the box on file systems
  - From: lsorense@csclub.uwaterloo.ca (Lennart Sorensen)
- Re: Thinking outside the box on file systems
  - From: Kyle Moffett <mrmacman_g4@mac.com>
- Re: Thinking outside the box on file systems
  - From: Phillip Susi <psusi@cfl.rr.com>
- Re: Thinking outside the box on file systems
  - From: Kyle Moffett <mrmacman_g4@mac.com>
- Re: Thinking outside the box on file systems
  - From: Phillip Susi <psusi@cfl.rr.com>
- Re: Thinking outside the box on file systems
  - From: Kyle Moffett <mrmacman_g4@mac.com>
- Re: Thinking outside the box on file systems
  - From: Phillip Susi <psusi@cfl.rr.com>
- Re: Thinking outside the box on file systems
  - From: Kyle Moffett <mrmacman_g4@mac.com>
- Re: Thinking outside the box on file systems
  - From: Phillip Susi <psusi@cfl.rr.com>

Prev by Date: Re: Thinking outside the box on file systems
Next by Date: Re: [PATCH] ACPI: boot correctly with "nosmp" or "maxcpus=0"
Previous by thread: Re: Thinking outside the box on file systems
Next by thread: Re: Thinking outside the box on file systems
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]