Re: [PATCH][RFC][0/4] InfiniBand userspace verbs implementation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



    Timur> With mlock(), we don't need to use get_user_pages() at all.
    Timur> Arjan tells me the only time an mlocked page can move is
    Timur> with hot (un)plug of memory, but that isn't supported on
    Timur> the systems that we support.  We actually prefer mlock()
    Timur> over get_user_pages(), because if the process dies, the
    Timur> locks automatically go away too.

There actually is another way pages can move, with both
get_user_pages() and mlock(): copy-on-write after a fork().  If
userspace does a fork(), then all PTEs are marked read-only, and if
the original process touches the page after the fork(), a new page
will be allocated and mapped at the original virtual address.

This is actually a pretty big pain, because the only good solution
seems to be for the kernel to mark these registered regions as
VM_DONTCOPY.  Right now this means that driver code ends up monkeying
with vm_flags for user vmas.

Does it seem reasonable to add a new system call to let userspace mark
memory it doesn't want copied into forked processes?  Something like

	long sys_mark_nocopy(unsigned long addr, size_t len, int mark)

which would set VM_DONTCOPY if mark != 0, and clear it if mark == 0.
A better name would be gratefully accepted...

Then to register memory for RDMA, userspace would call
sys_mark_nocopy() (with appropriate accounting to handle possibly
overlapping regions) and the kernel would call get_user_pages().  The
get_user_pages() is of course required because the kernel can't trust
userspace to keep the pages locked.  mlock() would no longer be
necessary.  We can trust userspace to call sys_mark_nocopy() as
needed, because a process can only hurt itself and its children by
misusing the sys_mark_nocopy() call.

If this seems reasonable then I can code a patch.

 - R.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux