Re: Database regression due to scheduler changes ?

Nick Piggin wrote:

I think you are right that the NUMA domain is probably being too
constrictive of task balancing, and that is where the regression
is coming from.

For some workloads it is definitely important to have the NUMA
domain, because it helps spread load over memory controllers as
well as CPUs - so I guess eliminating that domain is not a good
long term solution.

I would look at changing parameters of SD_NODE_INIT in include/
asm-powerpc/topology.h so they are closer to SD_CPU_INIT parameters
(ie. more aggressive).

I ran with the following:

--- topology.h.orig     2005-11-08 13:11:57.000000000 -0600
+++ topology.h  2005-11-08 13:17:15.000000000 -0600
@@ -43,11 +43,11 @@ static inline int node_to_first_cpu(int
       .span                   = CPU_MASK_NONE,        \
       .parent                 = NULL,                 \
       .groups                 = NULL,                 \
-       .min_interval           = 8,                    \
-       .max_interval           = 32,                   \
-       .busy_factor            = 32,                   \
+       .min_interval           = 1,                    \
+       .max_interval           = 4,                    \
+       .busy_factor            = 64,                   \
       .imbalance_pct          = 125,                  \
-       .cache_hot_time         = (10*1000000),         \
+       .cache_hot_time         = (5*1000000/2),        \
       .cache_nice_tries       = 1,                    \
       .per_cpu_gain           = 100,                  \
       .flags                  = SD_LOAD_BALANCE       \

There was no improvement in performance. The schedstats from this runfollow:

      2516          sys_sched_yield()
         0(  0.00%) found (only) active queue empty on current cpu
         0(  0.00%) found (only) expired queue empty on current cpu
        46(  1.83%) found both queues empty on current cpu
      2470( 98.17%) found neither queue empty on current cpu


   22969106          schedule()
    694922          goes idle
         3(  0.00%) switched active and expired queues
         0(  0.00%) used existing active queue

         0          active_load_balance()
         0          sched_balance_exec()

     0.19/1.28      avg runtime/latency over all cpus (ms)

[scheduler domain #0]
   1153606          load_balance()
     82580(  7.16%) called while idle
                        488(  0.59%) tried but failed to move any tasks
                      63876( 77.35%) found no busier group
                      18216( 22.06%) succeeded in moving at least one task
                                     (average imbalance:   1.526)
    317610( 27.53%) called while busy
                         15(  0.00%) tried but failed to move any tasks
                     220139( 69.31%) found no busier group
                      97456( 30.68%) succeeded in moving at least one task
                                     (average imbalance:   1.752)
    753416( 65.31%) called when newly idle
                        487(  0.06%) tried but failed to move any tasks
                     624132( 82.84%) found no busier group
                     128797( 17.10%) succeeded in moving at least one task
                                     (average imbalance:   1.531)

         0          sched_balance_exec() tried to push a task

[scheduler domain #1]
    715638          load_balance()
     68533(  9.58%) called while idle
                       3140(  4.58%) tried but failed to move any tasks
                      60357( 88.07%) found no busier group
                       5036(  7.35%) succeeded in moving at least one task
                                     (average imbalance:   1.251)
     22486(  3.14%) called while busy
                         64(  0.28%) tried but failed to move any tasks
                      21352( 94.96%) found no busier group
                       1070(  4.76%) succeeded in moving at least one task
                                     (average imbalance:   1.922)
    624619( 87.28%) called when newly idle
                       5218(  0.84%) tried but failed to move any tasks
                     591970( 94.77%) found no busier group
                      27431(  4.39%) succeeded in moving at least one task
                                     (average imbalance:   1.382)

         0          sched_balance_exec() tried to push a task

[scheduler domain #2]
    685164          load_balance()
     63247(  9.23%) called while idle
                       7280( 11.51%) tried but failed to move any tasks
                      52200( 82.53%) found no busier group
                       3767(  5.96%) succeeded in moving at least one task
                                     (average imbalance:   1.361)
     24729(  3.61%) called while busy
                        418(  1.69%) tried but failed to move any tasks
                      21025( 85.02%) found no busier group
                       3286( 13.29%) succeeded in moving at least one task
                                     (average imbalance:   3.579)
    597188( 87.16%) called when newly idle
                      67577( 11.32%) tried but failed to move any tasks
                     371377( 62.19%) found no busier group
                     158234( 26.50%) succeeded in moving at least one task
                                     (average imbalance:   2.146)

         0          sched_balance_exec() tried to push a task

I would also take a look at removing SD_WAKE_IDLE from the flags.
This flag should make balancing more aggressive, but it can have
problems when applied to a NUMA domain due to too much task
movement.

Independent from the run above, I ran with the following:

--- topology.h.orig     2005-11-08 19:32:19.000000000 -0600
+++ topology.h  2005-11-08 19:34:25.000000000 -0600
@@ -53,7 +53,6 @@ static inline int node_to_first_cpu(int
       .flags                  = SD_LOAD_BALANCE       \
                               | SD_BALANCE_EXEC       \
                               | SD_BALANCE_NEWIDLE    \
-                               | SD_WAKE_IDLE          \
                               | SD_WAKE_BALANCE,      \
       .last_balance           = jiffies,              \
       .balance_interval       = 1,                    \

There was no improvement in performance.

I didn't expect any change in performance this time, because I
don't think the SD_WAKE_IDLE flag is effective in the NUMA
domain, due to the following code in wake_idle:

       for_each_domain(cpu, sd) {
               if (sd->flags & SD_WAKE_IDLE) {
                       cpus_and(tmp, sd->span, p->cpus_allowed);
                       for_each_cpu_mask(i, tmp) {
                               if (idle_cpu(i))
                                       return i;
                       }
               }
               else
                       break;
       }

If I read that loop correctly it stops at the first domain
which doesn't have SD_WAKE_IDLE set, which is the CPU domain
(see SD_CPU_INIT), and thus it never gets to the NUMA domain.

Thanks for the suggestions Nick.  Andrew raises some
good questions that I will address tomorrow.

Cheers,
Brian

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Database regression due to scheduler changes ?
  - From: Brian Twichell <tbrian@us.ibm.com>
- Re: Database regression due to scheduler changes ?
  - From: David Lang <david.lang@digitalinsight.com>
- Re: Database regression due to scheduler changes ?
  - From: Brian Twichell <tbrian@us.ibm.com>
- Re: Database regression due to scheduler changes ?
  - From: Nick Piggin <nickpiggin@yahoo.com.au>

Prev by Date: Re: [PATCH] ieee1394: Conversions from kmalloc/memset to kzalloc.
Next by Date: Re: Linux 2.6.14.1
Previous by thread: Re: Database regression due to scheduler changes ?
Next by thread: Re: Database regression due to scheduler changes ?
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]