Re: [PATCH -mm 0/7] execns syscall and user namespace

Cedric Le Goater <[email protected]> writes:

> Hello !
>
> Hopefully, we will soon see each other at OLS. We need some synchronous
> interaction !
>
> Eric W. Biederman wrote:
>
>>>> Is it not possible to ensure what you are trying to ensure with
>>>> a good user space executable?
>>> unshare() is unsafe for some namespaces because namespaces can reference
>>> each other. For the ipc namespace, example are shm ids vs. vma, sem ids vs.
>>> semundos, msq vs. netlink sockets. for the user namespace, open files. So
>>> it seems reasonable to provide a way to unshare namespaces from a clean
>>> process context.
>> 
>> It is perfectly legitimate to have a shared memory region memory mapped
>> from another namespace. 
>
> then after unshare, a process can be in ipc namespace B with a shared
> memory segment from ipc namespace A without any id for this segment. this
> is not very consistent. the same process will also be able to modify the
> ipc namespace B without being in this namespace. ugly. It looks like an
> issue that should be solved.
>
> I think namespace should enforce strict isolation. nop ?
>
>> Yes sem ids versus semunds is an issue but it just requires you to unshare
>> one at the same time you unshare the other, or to simply clone a new
>> namespace.
>
> hmm, semids the from ipc namespace are stored in task->sysv_sem. i would
> forbid the unshare/clone in that case or flush the semundos like in
> exit_sem(). but it's easier not to have any, like in a clean process image.
>
>> I'm not familiar with the msq vs netlink socket issue.  
>
> mq_notify can use a netlink socket to send an event back user space.

Ok.  That one is a mess, and I almost recall seeing that.  A big
chunk of that is a general netlink socket problem.  Getting enough
context in a netlink socket is a challenge because you can't use current.
I do think solving that is achievable though.  Just very peculiar.

>> As for the user namespace vs open files.  If we have any issues with open
>> files in any namespace that sounds like an implementation bug to me.
>
> user_struct does accounting on process, open files, locked memory, signals,
> etc. if you unshare such an object, you will need to unshare all others
> namespaces to be consistent. again having a clean process image is easier ...

I just don't see it.    The accounting is about objects and the namespaces
are about names of those objects.

>> I'm not convinced the problems you are seeing are not implementation bugs.
>> For some things clone is still more general then unshare, and clone should
>> be considered the primary user interface, not unshare.
>
> agree on that, i might be focusing a bit too much on the unshare syscall.
> we should work on clone to make sure it has the required restrictions. The
> system is really interlinked and not all namespaces can be unshared standalone.

I completely agree that there are pieces that interlink.

>>> Now, if you try to do that from user space, you will call unshare() then
>>> execve(), which leaves plenty of room and time for nasty things to happen
>>> in between the 2 calls.
>> 
>> I will look more closely but I think there is an important point being missed
>> somewhere.  Pieces of the kernel interact in all sorts of weird and unexpected
>> ways.  If we rely on ourselves always being in the right magic namespace for
>> things to work correctly we are setting ourselves up for trouble.  If we know
>> a namespace implementation will work even when a process has access to
> entities
>> in multiple instances of that namespace we are in much better shape.
>
> having a clean process image is IMHO required for some namespaces :
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113881171017330&w=2

That message is a terrible example.  Unless you are thinking of something
farther down that thread.  User space getting confused when it creates
a container is just an implementation of the container creation code.

Now I'm not certain what you mean by a clean process image, as there are always
left over pieces from the parent.  Clone creates a new task_struct.  exec replaces
the executable.  They both keep files open.


Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [PATCH -mm 0/7] execns syscall and user namespace
  - From: Cedric Le Goater <[email protected]>

References:
- [PATCH -mm 0/7] execns syscall and user namespace
  - From: Cedric Le Goater <[email protected]>
- Re: [PATCH -mm 0/7] execns syscall and user namespace
  - From: [email protected] (Eric W. Biederman)
- Re: [PATCH -mm 0/7] execns syscall and user namespace
  - From: Cedric Le Goater <[email protected]>
- Re: [PATCH -mm 0/7] execns syscall and user namespace
  - From: [email protected] (Eric W. Biederman)
- Re: [PATCH -mm 0/7] execns syscall and user namespace
  - From: Cedric Le Goater <[email protected]>

Prev by Date: Re: [PATCH] reiserfs: fix handling of device names with /'s in them
Next by Date: xfs fails dbench in 2.6.18-rc1-mm1
Previous by thread: Re: [PATCH -mm 0/7] execns syscall and user namespace
Next by thread: Re: [PATCH -mm 0/7] execns syscall and user namespace
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]