Database regression due to scheduler changes ?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We observed a 1.5% regression in an OLTP database workload going from
2.6.13-rc4 to 2.6.13-rc5.  The regression has been carried forward
at least as far as 2.6.14-rc5.

Through experimentation, and through examining the changes that
went into 2.6.13-rc5, we found that we can eliminate the regression
in 2.6.13-rc5 with one straightforward change:  eliminating the
NUMA level from the CPU scheduler domain structures.

After observing this, we collected schedstats (provided below)
to try to determine how the scheduler behaves differently
when the NUMA level is eliminated.  It appears to us that
the scheduler is having more success in balancing in this
case.  We tried to duplicate this effect by changing parameters
in the NUMA-level and SMP-level domain definitions to
increase the aggressiveness of the balancing, but none of the
changes could recoup the regression.

We suspect the regression was introduced in the scheduler changes
that went into 2.6.13-rc1.  However, the regression was hidden
from us by a bug in include/asm-ppc64/topology.h that made ppc64
look non-NUMA from 2.6.13-rc1 through 2.6.13-rc4.  That bug was
fixed in 2.6.13-rc5.  Unfortunately the workload does not run to
completion on 2.6.12 or 2.6.13-rc1.  We have measurements on
2.6.12-rc6-git7 that do not show the regression.

One alternative for fixing this in 2.6.13 would have been to #define
ARCH_HAS_SCHED_DOMAINS and to introduce a ppc64-specific version
of build_sched_domains that eliminates the NUMA-level domain for
small (e.g. 4-way) ppc64 systems.  However, ARCH_HAS_SCHED_DOMAINS
has been eliminated from 2.6.14, and anyways that solution doesn't
seem very encompassing to me.

So, at this point I am soliciting assistance from scheduler experts
to determine how this regression can be eliminated.  We are keen
to prevent this regression from going into the next distro versions.
Simply shipping a distro kernel with CONFIG_NUMA off isn't a viable
option because we need it for our larger configurations.

Our system configuration is a 4-way 1.9 GHz Power5-based server.  As
the system supports SMT, it shows eight online CPUs.

Below are the schedstats.  The first set is with the NUMA-level
domain, while the second set is without the NUMA-level domain.

Cheers,
Brian Twichell

Schedstats (NUMA-level domain included)
----------------------------------------------------------------------
00:09:05--------------------------------------------------------------
      2845          sys_sched_yield()
         0(  0.00%) found (only) active queue empty on current cpu
         0(  0.00%) found (only) expired queue empty on current cpu
       157(  5.52%) found both queues empty on current cpu
      2688( 94.48%) found neither queue empty on current cpu


   23287180          schedule()
         1(  0.00%) switched active and expired queues
         0(  0.00%) used existing active queue

         0          active_load_balance()
         0          sched_balance_exec()

     0.19/1.17      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
   1418943          load_balance()
    112240(  7.91%) called while idle
                        499(  0.44%) tried but failed to move any tasks
                      80433( 71.66%) found no busier group
                      31308( 27.89%) succeeded in moving at least one task
                                     (average imbalance:   1.549)
    316022( 22.27%) called while busy
                         21(  0.01%) tried but failed to move any tasks
                     220440( 69.75%) found no busier group
                      95561( 30.24%) succeeded in moving at least one task
                                     (average imbalance:   1.727)
    990681( 69.82%) called when newly idle
                        533(  0.05%) tried but failed to move any tasks
                     808816( 81.64%) found no busier group
                     181332( 18.30%) succeeded in moving at least one task
                                     (average imbalance:   1.500)

         0          sched_balance_exec() tried to push a task

[scheduler domain #1]
    922193          load_balance()
     85822(  9.31%) called while idle
                       4032(  4.70%) tried but failed to move any tasks
                      70982( 82.71%) found no busier group
                      10808( 12.59%) succeeded in moving at least one task
                                     (average imbalance:   1.348)
     27022(  2.93%) called while busy
                        106(  0.39%) tried but failed to move any tasks
                      25478( 94.29%) found no busier group
                       1438(  5.32%) succeeded in moving at least one task
                                     (average imbalance:   1.712)
    809349( 87.76%) called when newly idle
                       6967(  0.86%) tried but failed to move any tasks
                     757097( 93.54%) found no busier group
                      45285(  5.60%) succeeded in moving at least one task
                                     (average imbalance:   1.338)

         0          sched_balance_exec() tried to push a task

[scheduler domain #2]
    825662          load_balance()
     52074(  6.31%) called while idle
                      17791( 34.16%) tried but failed to move any tasks
                      32839( 63.06%) found no busier group
                       1444(  2.77%) succeeded in moving at least one task
                                     (average imbalance:   1.981)
      9524(  1.15%) called while busy
                       1072( 11.26%) tried but failed to move any tasks
                       7654( 80.37%) found no busier group
                        798(  8.38%) succeeded in moving at least one task
                                     (average imbalance:   2.976)
    764064( 92.54%) called when newly idle
                     262831( 34.40%) tried but failed to move any tasks
                     409353( 53.58%) found no busier group
                      91880( 12.03%) succeeded in moving at least one task
                                     (average imbalance:   2.518)

         0          sched_balance_exec() tried to push a task


Schedstats (NUMA-level domain eliminated)
----------------------------------------------------------------------
00:09:03--------------------------------------------------------------
      2576          sys_sched_yield()
         0(  0.00%) found (only) active queue empty on current cpu
         0(  0.00%) found (only) expired queue empty on current cpu
       118(  4.58%) found both queues empty on current cpu
      2458( 95.42%) found neither queue empty on current cpu


   23617887          schedule()
   1106774          goes idle
         0(  0.00%) switched active and expired queues
         0(  0.00%) used existing active queue

         0          active_load_balance()
         0          sched_balance_exec()

     0.19/1.10      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
   1810988          load_balance()
    153509(  8.48%) called while idle
                        680(  0.44%) tried but failed to move any tasks
                     104906( 68.34%) found no busier group
                      47923( 31.22%) succeeded in moving at least one task
                                     (average imbalance:   1.658)
    317016( 17.51%) called while busy
                         30(  0.01%) tried but failed to move any tasks
                     217438( 68.59%) found no busier group
                      99548( 31.40%) succeeded in moving at least one task
                                     (average imbalance:   1.831)
   1340463( 74.02%) called when newly idle
                        762(  0.06%) tried but failed to move any tasks
                    1092960( 81.54%) found no busier group
                     246741( 18.41%) succeeded in moving at least one task
                                     (average imbalance:   1.564)

         0          sched_balance_exec() tried to push a task

[scheduler domain #1]
   1244187          load_balance()
    111326(  8.95%) called while idle
                       8396(  7.54%) tried but failed to move any tasks
                      71276( 64.02%) found no busier group
                      31654( 28.43%) succeeded in moving at least one task
                                     (average imbalance:   1.412)
     39138(  3.15%) called while busy
                        220(  0.56%) tried but failed to move any tasks
                      34676( 88.60%) found no busier group
                       4242( 10.84%) succeeded in moving at least one task
                                     (average imbalance:   1.360)
   1093723( 87.91%) called when newly idle
                      15971(  1.46%) tried but failed to move any tasks
                     932422( 85.25%) found no busier group
                     145330( 13.29%) succeeded in moving at least one task
                                     (average imbalance:   1.189)

         0          sched_balance_exec() tried to push a task


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux