Re: [PATCH] Fix longstanding load balancing bug in the scheduler.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 07, 2006 at 03:20:32PM -0700, Christoph Lameter wrote:
> On Thu, 7 Sep 2006, Nick Piggin wrote:
> 
> > How about if you have N/2 CPUs with lots of stuff on runqueues?
> > Then the other N/2 will each be scanning N/2+1 runqueues... that's
> > a lot of bus traffic going on which you probably don't want.
> 
> Then the load will likely be sitting on a runqueue and run 
> much slower since it has to share cpus although many cpus are available 
> to take new jobs. The scanning is very fast and it is certainly better 
> than continuing to overload a single processor.

But it is N^2... I thought SGI of all would hesitate to put such an
algorithm into the scheduler ;)

> > A sched-domain is per-CPU as well. Why doesn't it work?
> 
> Lets say you store the latest set of pinned cpus encountered (I am not 
> sure what cpu number would accomplish). The next time you have to 
> reschedule the situation may be completely different and you would have
> to revalidate the per cpu pinned cpu set. 

Maybe you misunderstood, I'll explain below.

> > Yes, it isn't going to be perfect, but it would at least work (which
> > it doesn't now) without introducing that latency.
> 
> Could you tell us how this could work?

You keep track of a fallback CPU. Then, if balancing fails because all
processes are pinned, you just try pulling from the fallback CPU. If
that fails, set the fallback to the next CPU.

OTOH that still results in suboptimal scheduler, and now that I remember
your workload, you had a small number of CPUs screwing up balancing for
a big system. Hmm... so this may be not great for you.

> 
> > > This looks to me as a design flaw that would require either a major rework 
> > > of the scheduler or (my favorite) a delegation of difficult (and 
> > > computational complex and expensive) scheduler decisions to user space.
> > 
> > I'd be really happy to move this to userspace :)
> 
> Can we merge this patch and then I will try to move this as fast as 
> possible to userspace?

I'd like to get an ack from Ingo, but otherwise OK I guess.

> 
> Ok then we need to do the following to address the current issue and to
> open up the possibilities for future enhancements:
> 
> 1. Add a new field to /proc/<pid>/cpu that can be written to and that
>    forces the process to be moved to the corresponding cpu.
> 
> 2. Extend statistics in /proc/<pid>/ to allow the export of more scheduler
>    information and a special flag that is set if a process cannot be 
>    rescheduling. initially the user space scheduler may just poll this 
>   field.
> 
> 3. Optional: Some sort of notification mechanism that a user space 
>    scheduler helper could subscribe to. If a pinned tasks situation
>    is encountered then the notification passes the number of the
>    idle cpu.

Hmm, how about

1. Export SD information in /sys/
2. Export runqueue load information there too
3. One attribute in /sys/.../CPUn/ will be written to a CPU number m and
   a count t, which will try to pull t tasks from CPUm to CPUn
4. Another attribute would be the result of the last balance attempt
   (eg. 'all pinned').

?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux