On Fri, 26 Oct 2007, Paul Jackson wrote:
> 1) If you want the current behaviour, where set_mempolicy(MPOL_INTERLEAVE)
> calls mean what they say, and cpusets tries as best it can (imperfectly)
> to honor those memory policy calls, even in the face of changing cpusets,
> then leave memory_spread_user turned off (the default, right?)
> 2) If you want MPOL_INTERLEAVE tasks to interleave user memory across all
> nodes in the cpuset, whatever that might be, then enable memory_spread_user.
>
> This is admittedly less flexible than your patch provided, but I am
> attracted to the simpler API - easier to explain.
>
That seems to follow the convention better with respect to
memory_spread_page and memory_spread_slab anyway. Either all the tasks
attached to the cpuset get that behavior or none of them do. We can do
the same for memory_spread_user.
Sounds good.
> This does beg yet another question: shouldn't memory_spread_user force
> interleaving of user memory -regardless- of mempolicy.
>
Sure, I don't see any compelling reason why it shouldn't.
> And yet another question: what about the MPOL_BIND mempolicy? It too,
> to a lesser extent, has the same problems with cpusets that shrink and
> then expand. Several tasks in a cpuset with multiple nodes could carefully
> bind to a separate node each, but then end up collapsed all onto the same
> node if the cpuset was shrunk to one node and then expanded again.
>
Hmm. At some point we're going to have to just say that if you use
mempolicies such as MPOL_BIND in your application and then you insanely
take those nodes away from your application via cpusets, that you are
actually getting exactly what you asked for.
There's two ways to fix that: try to remap the MPOL_BIND nodes onto the
new set of mems_allowed regardless of the cardinality of the two sets, or
refuse to update the nodemask of the cpuset if you're taking one or more
nodes away from an attached task that has such a policy. I favor the
former because, in conjunction with a sane memory_migrate setting, it
shouldn't actually matter that much. The memory you previously allocated
will still be on the removed nodes; only your future allocations will
actually respect the new nodemask.
The MPOL_INTERLEAVE case is more interesting because we're trying to
reduce bus contention and decrease our latency with quicker memory access.
So, using the true definition of a node as a premise, we should get better
results in terms of performance if we expand the nodemask as much as
possible. That's exactly what we've been trying to address: when an
application's mems_allowed is expanded to allow more nodes, the
application is unaware of the change and can't take advantage of the
it (without the get_mempolicy() - set_mempolicy() loop). That's my whole
case for why cpusets should be modifying MPOL_INTERLEAVE policies in the
first place: because they are the ones that allowed access to more memory.
> On a different point, we could, if it was worth the extra bit of code,
> improve the current code's handling of mempolicy rebinding when the
> cpuset adds memory nodes. If we kept both the original cpusets
> mems_allowed, and the original MPOL_INTERLEAVE nodemask requested by
> the user in a call to set_mempolicy, then we could rebind (nodes_remap)
> the currently active policy v.nodes using that pair of saved masks to
> guide the rebinding. This way, if say a cpuset shrunk, then regrew back
> to its original size (original number of nodes) we would end up
> replicating the original MPOL_INTERLEAVE request, cpuset relative.
Keeping a copy of the nodemask passed to set_mempolicy() in struct
mempolicy is an interesting idea and could, with the logic you describe,
help guide the remapping as the set of allowed nodes changes. We'd have
two interleaved nodemasks, the actual (pol->v.nodes) and the requested
(something like pol->passed_nodemask). get_mempolicy() would always
return the actual nodemask so the application can be aware of what it has
access to and what it doesn't. I like it.
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]