Re: [RFC][PATCH 0/4] Generic container system

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/4/06, Chandra Seetharaman <[email protected]> wrote:
> The filesystem was lifted straight from cpuset.c, and hence isn't a
> reimplementation, it's a migration of code already in the tree. Wasn't

Ok. I can't call it re-implementing :). But, I guess you get the point.
This is an oppurtunity to remove the filesystem implementation and use
existing infrastructure, configfs. configfs didn't exist when cpuset
went in, otherwise they might have chosen to use it instead of writing
their own.

I guess I'm mostly agnostic about this issue. But looking at the
configfs interfacing code in rgcs.c versus the filesystem detail code
in container.c, it's about 200 lines vs 300 lines, so not exactly a
huge complexity saving.

Yes, Joel is aware of it and is open to make that change.
http://marc.theaimsgroup.com/?l=ckrm-tech&m=115619222129067&w=2. Having
a in-tree user (this infrastructure + cpuset) for that feature will
increase the need for it.

Great - when that happens I'd be much happier to move over configfs.


My concern is that the container _will_ be considered empty if there is
no task attached with the container _and_ there is no sub-container.

Right, that's the definition of an empty container.


CKRM/RG would want a empty container to exist.

Even if the user wants to be told when it's OK to clean up their RG containers?

I don't see why this is a problem - all this does is allows the user
(*if* they want to) to get a callback when there's no longer anything
alive in the container. It doesn't cause the container to be removed,
without additional explicit action from userspace.

I don't see it as a feature that I'd make much use of myself, but it
preserves the existing cpusets API.


cpuset may be happy today. But, It will not be happy when there are tens
of other container subsystems use the same locks to protect their own
data structures. Using such coarse locking will certainly affect the
scalability.

The locks exported by the container system simply guarantee:

- if you hold container_manage_lock(), then no-one else will make any
changes to the core container groupings, and you won't block readers

- if you hold container_lock(), then no changes will be made to the
container groupings.

While there's nothing to stop a container subsystem from also using
them to protect its own data, there's equally nothing to stop a
container subsystem from using its own finer-grained locking, if it
feels that it needs to. E.g. the simple cpu_acct subsystem that I
posted as an example has a per-container spinlock that it uses to
protect the cpu stats for that container. So the callback doesn't need
to take either container_lock() or container_manage_lock(), it just
has to ensure it's in an rcu_read_lock() section and takes its own
spinlock.

No questions about that. But, do recall BKL and how much effort has gone
in to break it to add scalability ( I am not saying that these locks are
same as that). When we are starting afresh, why not start with
scalability in mind.

Looking at the group_lock in RG (including memrc and cpurc), the only
places that it appears to get taken are when creating and destroying
groups, or when changing shares. These are things that are going to be
done by the middleware - how many concurrent middleware systems are
you expecting to need to scale to? :-)

Incidentally, I have a few questions about the locking in the example
numtasks controller:

- recalc_and_propagate() does:

 lock rgroup->group_lock
 for each child:
   lock child->group_lock
   recalc_and_propagate(child)

So the first thing that the recursive call does is to take the lock
that the previous level of recursion has already taken. How's that
supposed to work?

E.g.

cd /mnt/configfs/res_groups
mkdir foo
echo -n res=numtasks,min_shares=10,max_shares=10 > shares

causes a lockup.

- in dec_usage_count() when a task exits, am I right that the only
locking done is on the group from which the task exited? The calls can
cascade up to the parent if cnt_borrowed is non-zero. In that case, in
the situation where the parent and two child groups each have a
cnt_borrowed of 1, what's to stop the dec_usage_count() calls racing
with one another in the parent, hence leaving parent->cnt_borrowed at
-1, and both calling dec_usage_count() on the grandparent?

I realise that these issues can be fixed, but they do serve to
illustrate the pitfalls (or at least, reduced clarity) associated with
fine-grained locking ...


My thinking was like this: cpuset was the first user of this interface,
any future container subsystem writers will certainly use cpuset as an
example to write their subsystems. In effect, use the container-global
locks to protect their data structures, which is not good in the long
run.

Fair point. I'll see about reducing some of the coarse-grained locking
in cpuset.c. But a lot of the time, the cpuset code needs to be able
to prevent both changes to current->cpuset, and changes to its
reference - so it has to take container_lock() anyway, and might as
well use that as the only lock, rather than having to take two locks.

One optimization would be to have callback_mutex be an rwsem, so we
could export container_lock()/container_unlock() as well as
container_read_lock() and container_read_unlock(). (PaulJ, any
thoughts on that?)

> > - Tight coupling of subsystems: I like your idea (you mentioned in a
> >   reply to the previous thread) of having an array of containers in task
> >   structure than the current implementation.
>
> Can you suggest some scenarios that require this?

Consider a scenario where you have only the system level cpuset and have
multiple RGs. With this model you would be forced to create multiple
cpusets (with the same set of cpus) so as to allow multiple RG's.

No - see the intro to this patch set. I added a "<subsys>_enabled"
file for each subsystem, managed by the container.c code; if this is
set to 0, then child containers inherit the same per-subsystem state
for that subsystem as the root container.

So if you do echo 0 > /mnt/container/cpuset_enabled, and create child
containers, you still only have one cpuset.

Now,
consider you want to create a cpuset that is a subset of the high level
cpuset, where in the hierarchy you would create this cpuset (at top
level or one level below) ?

For a distinct set of tasks, or for the same set of tasks? If the
former, then it can be a new hierarchy under the top-level, no
problem. If it's for the same set of tasks then you'd need to give
more details about how/why you're trying to split up resources. I
would imagine that almost all situations like this, there would be a
natural way to split things up hierarchically. E.g. you want to pin
each customer's tasks on to a particular set of nodes (so use cpuset
parameters at the top level) and within those you want to let the
customer have different groups of tasks with e.g. different memory
limits (so use RG mem controller parameters at the next level)

Paul
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux