Re: Futexes and network filesystems.

On Nov 20, 2007, at 17:53:52, Er ic W. Biederman wrote:

I had a chance to think about this a bit more, and realized thatthe problem is that futexes don't appear to work on networkfilesystems, even if the network filesystems provide coherentshared memory.
It seems to me that we need to have a call that gets a unique tokenfor a process for each filesystem per filesystem for use in futexes(especially robust futexes). Say get_fs_task_id(const char *path);
On local filesystems this could just be the pid as we use today,but for filesystems that can be accessed from contexts withpotentially overlapping pid values this could be something else.It is an extra syscall in the preparation path, but it should behardly more expensive the current getpid().
Once we have fixed the futex infrastructure to be able to handlefutexes on network filesystems, the pid namespace case will betrivial to implement.

Actually, I would think that get_vm_task_id(void *addr) would be amore useful interface. The call would still be a relatively simplelookup to find the struct file associated with the particular virtualmapping, but it would be race-free from the perspective of userspaceand would not require that we somehow figure out the file descriptorassociated with a particular mmap() (which may be closed by thispoint in time). Useful extension would be the get_fd_task_id(int fd)and get_fs_task_id(const char *path), but those are less important.

The other important thing is to ensure that somehow the numbers areconsidered unique only within the particular domain of a container,such that you can migrate a container from one system to another evenusing a simple local ext3 filesystem (on a networked block device)and still be able to have things work properly even after themigration. Naturally this would only work with an upgraded libc butI think that's a reasonable requirement to enforce for migration offutexes and cross-network futexes.

Even for network filesystems which don't implement coherent sharedmemory, you might add a memexcl() system call which (when used bymultiple cooperating processes) ensures that a given page is onlyever mapped by at most one computer accessing a given networkfilesystem. The page-outs and page-ins when shuttling that pageacross the network would be expensive, but I believe the cost wouldbe reasonable for many applications and it would allow traditionalatomic ops on the mapped pages to take and release futexes in theuncontended case.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: Futexes and network filesystems.
  - From: [email protected] (Eric W. Biederman)

References:
- [patch] PID namespace design bug, workaround
  - From: Ingo Molnar <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Pavel Emelyanov <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Ulrich Drepper <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Pavel Emelyanov <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Ulrich Drepper <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Pavel Emelyanov <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Andrew Morton <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Dave Hansen <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Linus Torvalds <[email protected]>
- Re: [patch] PID namespace design bug, workaround
  - From: Ingo Molnar <[email protected]>
- Futexes and network filesystems.
  - From: [email protected] (Er ic W. Biederman)

Prev by Date: Re: 2.6.24-rc3-mm1
Next by Date: Re: 2.6.24-rc3-mm1 - Kernel Panic on IO-APIC
Previous by thread: Futexes and network filesystems.
Next by thread: Re: Futexes and network filesystems.
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]