Re: [RFC] scheduler issue & patch

On Mon, Jun 12, 2006 at 05:30:42PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> I'm looking into a scheduler issue with a NUMA box and scheduling
> domains.  The machine is a dual-core opteron with with two nodes, i.e.
> four cpus.  cpu0+1 build node0, cpu2+3 build node1.
> 
> Now I have an application (benchmark) with two threads which performs
> best when the two threads are running on different nodes (probably
> because the cpus on each node share the L2 cache).  The scheduler tends
> to keep threads on the local node though, wihch probably makes sense on
> most cases because local memory is faster.
> 
> Ok, we have tools to give hints to the scheduler (taskset, numactl).
> The problem is it doesn't work well.  I can ask the scheduler to use
> cpu1 (node0) and cpu3 (node1) only (via "taskset 0x0a").  But the
> scheduler very often schedules both threads on the same cpu :-(
> 
> I think the reason is that the scheduler always checks the complete cpu
> groups when calculation the group load, without looking at
> task->cpus_allowed.  So we have the effect that the scheduler walks down
> the scheduler domain tree, looks at the group for node0, looks at both
> cpu0 and cpu1, finds node0 being not overloaded due to cpu0 being idle
> and decides to keep the thread on the local node.  Next it walks down
> the tree and finds it isn't allowed to use the idle cpu0.  So both
> threads get scheduled to cpu1.  Oops.

I don't think it is the problem with sched_balance_self(). sched_balance_self()
probably is doing the right thing based on the load that is present at the
time of fork/exec. Once the node-1 becomes idle, we expect the two threads
on node-0 cpu-1 to get distributed between the two nodes.

Perhaps the real issue is how cpu_power is calculated for node domain
on these systems. Because of the shared resources between the cpus in a node,
cpu_power for a group in node domain should be < 2 * SCHED_LOAD_SCALE..

Once this is the case, find_busiest_group() should detect the imbalance and
move one of the threads from cpu-1(node-0) to cpu-3(node-1)

> The patch attached takes the sledgehammer approach to fix it:  In case
> we have a non-default cpumask in task->cpus_allowed the scheduler
> ignores all the fancy scheduling domains and simply spreads the load
> equally over the cpus allowed by task->cpus_allowed.  Not exactly
> elegant, but works.  Not each time, but very often.
> 
> Comments?  Ideas how to solve this better?  I've also tried to play with
> the group load calculation, but it didn't work well.  I'm kida lost in
> all those scheduler tuning knobs ...

In my opinion, this patch is not the correct fix for the issue.

thanks,
suresh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] scheduler issue & patch
  - From: Gerd Hoffmann <[email protected]>

References:
- [RFC] scheduler issue & patch
  - From: Gerd Hoffmann <[email protected]>

Prev by Date: Re: [PATCH 1/1] usb: new driver for Cypress CY7C63xxx mirco controllers
Next by Date: Re: 2.6.16-rc6-mm2
Previous by thread: [RFC] scheduler issue & patch
Next by thread: Re: [RFC] scheduler issue & patch
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]