Re: [patch 2/2] cpusets: add interleave_over_allowed option

On Sun, 28 Oct 2007, Paul Jackson wrote:

> The Linux documentation is not a legal contract.  Anytime we change the
> actual behaviour of the code, we have to ask ourselves what will be the
> impact of that change on existing users and usages.  The burden is on
> us to minimize breaking things (by that I mean, what users would
> consider breakage, even if we think it is all for the better and that
> their code was the real problem.)  I didn't say no breakage, but
> minimum breakage, doing our best to guide users through changes with
> minimum disruption to their work.
> 

Nobody can show an example of an application that would be broken because 
of this and, given the scenario and sequence of events that it requires to 
be broken when implementing the default as Choice B, I don't think it's as 
much of an issue as you believe.

> >     However, in deference to the needs of libnuma, if the following
> >     call was made, this would change the mode for that task to
> >     Choice A:
> > 
> > 	get_mempolicy(NULL, NULL, 0, 0, 0);
> > 
> >     This last detail above is an admitted hack.  *So far as I know*
> >     it happens that all current infield versions of libnuma issue the
> >     above call, as their first mempolicy query, to detemine whether
> >     the active kernel supports mempolicy.
> 
> The above is the hack that allows us to support existing libnuma based
> applications (the most significant users of memory policy historically)
> with a default of Choice A, while other code and future code defaults
> to Choice B.
> 

So all applications that use the libnuma interface and numactl will have 
different default behavior than those that simply issue 
{get,set}_mempolicy() calls.  libnuma is a collection of higher level 
functions that should be built upon {get,set}_mempolicy() like they 
currently are and not introduce new subtleties like changing the semantics 
of a preferred node argument.  This is going to quickly become a 
documentation nightmare and, in my opinion, isn't worth the time or effort 
to support because we haven't even idenitifed any real-world examples.

Maybe Andi Kleen should weigh in on this topic because, if we go with what 
you're suggesting, we'll never get rid of the two differing behaviors and 
we'll be introducing different semantics to arguments of libnuma functions 
than the kernel API they are built upon.

> I'm trying to protect almost any application that uses both
> set_mempolicy or mbind, while in a cpuset.
> 
>     If a task is in a cpuset on say nodes 16-23, and it wants to issue
>     any mbind, or any MPOL_PREFERRED, MPOL_BIND, or MPOL_INTERLEAVE
>     mempolicy call, then under Choice A it must issue nodemasks offset
>     by 16, relative to what it would issue under Choice B.
> 

True, but the ordering of that scenario is troublesome.  The correct way 
to implement it is to use set_mempolicy() or a higher level libnuma 
function with the same semantics and _then_ attach the task to a cpuset.  
Then the nodes_remap() takes care of the rest.

The scenario you describe above has a problem because it requires the task 
to have knowledge of the cpuset's mems in which it is attached when, for 
portability, it should have been written so that it is robust to any range 
of nodes you happen to assign it to.

> Almost any task using memory policies on a system making active use of
> cpusets will be affected, even well written ones doing simple things.
> 

No, because nodes_remap() takes care of the instances you describe above 
when the task sets its memory policy (usually done when it is started) and 
is then attached to a cpuset.

> However, if anyone is deploying product or has important (to them) but
> not easy to change software using both memory policies and cpusets that
> are not doing so via libnuma or libcpuset, then any change that
> changes the default for their memory policy calls from Choice A to
> Choice B will probably break them.
> 

Supporting two different behaviors is going to be more problematic than 
simply selecting one and going with it and its associated documentation in 
future versions of the kernel.

> I'm tempted to think I need to go a bit further:
>  1) Add a per-system or per-cpuset runtime mode, to enable a system
>     administrator to revert to the Choice A default.

Paul, the changes required to an application that is currently using 
{get,set}_mempolicy() calls to setup the memory policy or the higher level 
functions through libnuma is so easy to use Choice B as a default instead 
of Choice A that it would be ridiculous to support configuring it on a 
per-system or per-cpuset basis.

>  2) Make sure that both the new libnuma and libcpuset versions dynamically
>     probe the state of this default, and dynamically adapt to running
>     on a kernel, or in a cpuset, of either default.  If this mode is
>     per-cpuset, then so long as there are not two applications that
>     must both run in the same cpuset, both using memory policies,
>     neither using libnuma or libcpuset, requiring conflicting defaults
>     (one Choice A, the other Choice B) then this would provide an
>     administrator sufficient flexibility to adapt.
> 

Choosing only one behavior for the kernel (Choice B) is by far the 
superior selection because then any task can share a cpuset with any other 
task and implement its memory policy preferences in terms of low level 
system calls, numactl, or libnuma.  That's the power that we should be 
giving users, not the addition of hacks or more configuration knobs that 
is going to clutter and confuse anybody who wants to simply pick a 
preferred node.

> There would be one key advantage to a per-cpuset mode flag here.  It
> exposes in a fairly visible way this change in the default numbering of
> memory policy node masks, from cpuset relative to system relative (with
> the system now automatically making them cpuset relative across any
> changes in the number of nodes in the cpuset.)  This is a classic way
> to guide users to understanding new alternatives and changes; expose it
> as a 'button on the dashboard', a feature, not a hidden gotcha.
> 

Yet the 'mems' file would still be system-wide; otherwise it would be 
impossible to expand the memory your cpuset has access to.  Everything 
else would be relative to 'mems'.

		David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>

References:
- [patch 1/2] cpusets: extract mmarray loading from update_nodemask
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Lee Schermerhorn <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Lee Schermerhorn <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Lee Schermerhorn <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Christoph Lameter <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: David Rientjes <[email protected]>
- Re: [patch 2/2] cpusets: add interleave_over_allowed option
  - From: Paul Jackson <[email protected]>

Prev by Date: Linux 2.6.16.56-rc2
Next by Date: Re: [PATCH 1/3 -v4] x86_64 EFI runtime service support: EFI basic runtime service support
Previous by thread: Re: [patch 2/2] cpusets: add interleave_over_allowed option
Next by thread: Re: [patch 2/2] cpusets: add interleave_over_allowed option
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]