[BUG] sched: big numa dynamic sched domain memory corruption

We have hit a bug in the dynamic sched domain setup.  It causes random
memory writes via a stale pointer.

I don't entirely understand the code yet, so my description of this
bug may be flawed.  I'll do the best I can.  Thanks to Jack Steiner
for figuring out what we know so far.

The three systems we are testing on have 128, 224 and 256 CPUs.
They are single core, ia64 SN2 itanium systems configured with:
  CONFIG_NUMA - enabled
  CONFIG_SCHED_MC - disabled
  CONFIG_SCHED_SMT - disabled

They are running approximately the 2.6.16.* kernel found in SLES10.

We first noticed the problem due to the memory clobbering, and
had to crank up the slab debug code a notch to backtrack to the
apparent original cause.  The bug does not cause an immediate
crash or kernel complaint.

In sum, it appears that the large array sched_group_allnodes is
free'd up by arch_destroy_sched_domains() when someone redefines
the cpu_exclusive portion of the cpuset configuration, but some
of the sd->groups are left pointing into the free'd array, causing
the assignment:
	sd->groups->cpu_power = power;
to write via a stale sd->groups pointer.

The build_sched_domains() code only rebuilds the sd->groups pointer
to the current sched_group_allnodes array for those cpus that are
in the specified cpu_map.  The remaining cpus seem to be left with
stale sd->groups pointers.

The above summary may be too inaccurate to be helpful.

I'll step through the failing scenario in more detail, and hopefully
with fewer inaccuracies.


    During the system boot, the initial call to build_sched_domains()
    sets up all encompasing sched_group_allnodes and the smaller
    child domains and groups.  So far, all is well.  Part of
    this initialization includes allocating a large array called
    sched_group_allnodes, and for each cpu in the system, initializing
    its sd->groups->cpu_power element in the sched_group_allnodes
    array.

    After boot, we run some commands that create a child cpuset,
    with, for this example, cpus 4-8, marked cpu_exclusive.

    This calls arch_destroy_sched_domains(), which frees
    sched_group_allnodes.

    Then this calls build_sched_domains() with a mask including
    *all-but* cpus 4-8 (in this example).  That call allocates a new
    sched_group_allnodes and in the first for_each_cpu_mask() loop,
    initializes the sched domain, including sd->groups, for *all-but*
    cpus 4-8.  The sd->groups for 4-8 are still pointing back at
    the now freed original sched_group_allnodes array.

    Then we call build_sched_domains() again, with a mask for just
    cpus 4-8.  It executes the line:
    	sd->groups->cpu_power = power;
    with a stale sd->groups pointer, clobbering the already freed
    memory that used to be in the sched_group_allnodes array.  For our
    situation, we are in the "#ifdef CONFIG_NUMA" variant of this line.


Aha - while writing the above, I had an idea for a possible fix.

The following patch seems to fix this problem, at least for the
above CONFIG on one of the test systems.  Though I have no particular
confidence that it is a good patch.

The idea of the patch is to -always- execute the code conditioned by
the "if (... > SD_NODES_PER_DOMAIN*...) {" test on big systems, even
if we happen to be calling build_sched_domains() with a small cpu_map.

---

 kernel/sched.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

--- linux.orig/kernel/sched.c	2006-07-30 23:42:12.182958555 -0700
+++ linux/kernel/sched.c	2006-07-30 23:45:12.513282355 -0700
@@ -5675,12 +5675,13 @@ void build_sched_domains(const cpumask_t
 		int group;
 		struct sched_domain *sd = NULL, *p;
 		cpumask_t nodemask = node_to_cpumask(cpu_to_node(i));
+		int cpus_per_node = cpus_weight(nodemask);
 
 		cpus_and(nodemask, nodemask, *cpu_map);
 
 #ifdef CONFIG_NUMA
-		if (cpus_weight(*cpu_map)
-				> SD_NODES_PER_DOMAIN*cpus_weight(nodemask)) {
+		if (cpus_weight(cpu_online_map)
+				> SD_NODES_PER_DOMAIN*cpus_per_node) {
 			if (!sched_group_allnodes) {
 				sched_group_allnodes
 					= kmalloc(sizeof(struct sched_group)


-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <[email protected]> 1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [BUG] sched: big numa dynamic sched domain memory corruption
  - From: Ingo Molnar <[email protected]>

Prev by Date: Re: [01/04 mm-patch, rfc] Add lightweight rwlock
Next by Date: Re: 2.6.17 -> 2.6.18 regression: cpufreq broken since 2.6.18-rc1 on pentium4
Previous by thread: Re: fctnl(F_SETSIG) no longer works in 2.6.17, does in 2.6.16.
Next by thread: Re: [BUG] sched: big numa dynamic sched domain memory corruption
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]