Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices

Paul M wrote:
> One drawback to this that I can see is the following:
> 
> - suppose you mount a containerfs with controllers cpuset and cpu, and
> create some nodes, and then unmount it, what happens? do the resource
> nodes stick around still?

Sorry - I let interfaces get confused with objects and operations.

Let me back up a step.  I think I have beat on your proposal enough
to translate it into the more abstract terms that I prefer using when
detemining objects, operations and semantics.

It goes like this ... grab a cup of coffee.

We have a container mechanism (the code you extracted from cpusets)
and the potential for one or more instantiations of that mechanism
as container file systems.

This container mechanism provides a pseudo file system structure
that names not disk files, but partitions of the set of tasks
in the system.  As always, by partition I mean a set of subsets,
covering and non-intersecting.  Each element of a partition of the
set of tasks in a system is a subset of that systems tasks.

The container mechanism gives names and permissions to the elements
(subsets of tasks) of the partition, and provides a convenient place
to attach attributes to those partition elements. The directories in
such a container file system always map one-to-one with the elements of
the partition, and the regular files in each such directory represent
the per-element (per-cpuset, for example) attributes.

Each directory in a container file system has a file called 'tasks'
listing the pids of the tasks (newline separated decimal ASCII format)
in that partition element.

Each container file system needs a name.  This corresponds to the
/dev/sda1 style raw device used to name disk based file systems
independent of where or if they are mounted.

Each task should list in its /proc/<pid> directory, for each such
named container file system in the system, the container file system
relative path of whichever directory in that container (element in
the partition it defines) that task belongs to.  (An earlier proposal
I made to have an entry for each -controller- in each /proc/<pid>
directory was bogus.)

Because containers define a partition of the tasks in a system, each
task will always be in exactly one of the partition elements of a
container file system.  Tasks are moved from one partition element
to another by writing their pid (decimal ASCII) into the 'tasks'
file of the receiving directory.

For some set of events, to include at least the 'release' of a
container element, the user can request that a callout be made to
a user executable.  This carries forth a feature previously known
as 'notify_on_release.'

We have several controllers, each of which can be instantiated and
bound to a container file system.  One of these controllers provides
for NUMA processor and memory placement control, and is called cpusets.

Perhaps in the future some controllers will support multiple instances,
bound to different container file systems, at the same time.
By different here, I meant not just different mounts of the same
container file system, but different partitions that divide up the
tasks of the system in different ways.

Each controller specifies a set of attributes to be associated with
each partition element of a container.  The act of associating a
controllers attributes with partition elements I will call "binding".

We need to be able to create, destroy and list container file systems,
and for each such container file system, we need to be able to bind
and unbind controller instances thereto.

We need to be able to list what controller types exist in the system
capable of being bound to containers.  We need to be able to list
for each container file system what controllers are bound to it.

And we need to be able to mount and unmount container file systems
from specific mount point paths in the file system.

We definitely need to be able to bind more than one controller to a
given container file system at the same time.  This was my item (3)
in an earlier post today.

We might like to support multiple container file systems at one time.
This seems like a good idea to at least anticipate doing, even if it
turns out to be more work than we can deliver immediately.  This was
my item (4) in that earlier post.

We will probably have some controllers in the future that are able
to be bound to more than one container file system at the same time,
and we have now, and likely will always have, some controllers, such
as cpusets, that must be singular - at most one bound instance at a
time in the system  This relates to my (buried) item (5) from that
earlier post.  The container code may or may not be able to support
controllers that bind to more than one file system at a time; I don't
know yet either how valuable or difficult this would be.

Overloading all these operations on the mount/umount commands seems
cumbersome, obscure and confusing.  The essential thing a mount does
is bind a kernel object (such as one of our container instances) to
a mount point (path) in the filesystem.  By the way, we should allow
for the possibility that one container instance might be mounted on
multiple points at the same time.

So it seems we need additional API's to support the creation and
destruction of containers, and binding controllers to them.

All controllers define an initial default state, and all tasks
can reference, while in that tasks context in the kernel, for any
controller type built into the system (or loadable module ?!), the
per-task state of that controller, getting at least this default state
even if the controller is not bound.

If a controller is not bound to any container file system, and
immediately after such a binding, before any of its per-container
attribute files have been modified via the container file system API,
the state of a controller as seen by a task will be this default state.
When a controller is unbound, then the state it presented to each
task in the system reverts to this default state.

Container file systems can be unmounted and remounted all the
while retaining their partitioning and  any binding to controllers.
Unmounting a container file system just retracts the API mechanism
required to query and manipulate the partitioning and the state per
partition element of bound controllers.

A basic scenario exemplifying these operations might go like this
(notice I've still given no hint of the some of the API's involved):

  1) Given a system with controllers Con1, Con2 and Con3, list them.
  2) List the currently defined container file systems, finding none.
  3) Define a container file system CFS1.
  4) Bind controller Con2 to CFS1.
  5) Mount CFS1 on /dev/container.
  6) Bind controller Con3 to CFS1.
  7) List the currently defined container file systems, finding CFS1.
  8) List the controllers bound to CFS1, finding Con2 and Con3.
  9) Mount CFS1 on a second mount point, say /foo/bar/container.
     This gives us two pathnames to refer to the same thing.
 10) Refine and modify the partition defined by CFS1, by making
     subdirectories and moving tasks about.
 11) Define a second container file system - this might fail if our
     implementation doesn't support multiple container file systems
     at the same time yet.  Call this CFS2.
 12) Bind controller Con1 to CFS2.  This should work.
 13) Mount CFS2 on /baz.
 14) Bind controller Con2 to CFS2.  This may well fail if that
     controller must be singular.
 15) Unbind controller Con2 from CFS2.  After this, any task referencing
     its Con2 controller will find the minimal default state.
 16) If (14) failed, try it again.  We should be able to bind Con2 to
     CFS2 now, if not earlier.
 17) List the mount points in the system (cat /proc/mounts).  Observe
     two entries of type container. for CFS1 and CFS2.
 18) List the controllers bound to CFS2, finding Con1 and Con2.
 19) Unmount CFS2.  Its structure remains, however one lacks any API to
     observe this.
 20) List the controllers bound to CFS2 - still Con1 and Con2.
 20) Remount CFS2 on /bornagain.
 21) Observe its structure and the binding of Con1 and Con2 to it remain.
 22) Unmount CFS2 again.
 23) Ask to delete CFS2 - this fails due because it has controllers
     bound to it.
 24) Unbind controllers Con1 and Con2 from CFS2.
 25) Ask to delete CFS2 - this succeeds this time.
 26) List the currently defined container file systems, once again
     finding just CFS1.
 27) List the controllers bound to CFS1, finding just Con3.
 28) Examine the regular files in the directory /dev/container, where
     CFS1 is currently mounted.  Find the files representing the
     attributes of controller Con3.

If you indulged in enough coffee to stay awake through all that,
you noticed that I invented some rules on what would or would not
work in certain situations.  For example, I decreed in (23) that one
could not delete a container file system if it had any controllers
bound to it.  I just made these rules up ...

I find it usually works best if I turn the objects and operations
around in my head a bit, before inventing API's to realize them.

So I don't yet have any firmly jelled views on what the additional
API's to manipulate container file systems and controller binding
should look like.

Perhaps someone else will beat me to it.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <[email protected]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Srivatsa Vaddagiri <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>

References:
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Paul Jackson <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Srivatsa Vaddagiri <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Srivatsa Vaddagiri <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Srivatsa Vaddagiri <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: Paul Jackson <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>
- Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  - From: "Paul Menage" <[email protected]>

Prev by Date: Re: 2048 CPUs [was: Re: New filesystem for Linux]
Next by Date: Re: 2.6.18-rt7: rollover with 32-bit cycles_t
Previous by thread: Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
Next by thread: Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]