Re: [RFC] scheduler: improve SMP fairness in CFS

Tong Li wrote:

On Fri, 27 Jul 2007, Chris Snook wrote:
Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular codeto the kernel. I'm a researcher. My intent was to point out that wehave a problem in the scheduler and my dwrr algorithm can potentiallyhelp fix it. The patch itself was merely a proof-of-concept. I'd bethrilled if the algorithm can be proven useful in the real world. Iappreciate the people who have given me comments. Since then, I'verevised my algorithm/code. Now it doesn't require global locking butretains strong fairness properties (which I was able to provemathematically).
Thanks for doing this work. Please don't take the implementationcriticism as a lack of appreciation for the work. I'd like to seedwrr in the scheduler, but I'm skeptical that re-introducing expiredrunqueues is the most efficient way to do it.
Given the inherently controversial nature of scheduler code,particularly that which attempts to enforce fairness, perhaps aconcise design document would help us come to an agreement about whatwe think the scheduler should do and what tradeoffs we're willing tomake to do those things. Do you have a design document we could discuss?
    -- Chris
Thanks for the interest. Attached is a design doc I wrote several monthsago (with small modifications). It talks about the two pieces of mydesign: group scheduling and dwrr. The description was based on theoriginal O(1) scheduler, but as my CFS patch showed, the algorithm isapplicable to other underlying schedulers as well. It's interesting thatI started working on this in January for the purpose of eventuallywriting a paper about it. So I knew reasonably well the related researchwork but was totally unaware that people in the Linux community werealso working on similar things. This is good. If you are interested, I'dlike to help with the algorithms and theory side of the things.
  tong

-------------------------------------------
Overview:
Trio extends the existing Linux scheduler with support forproportional-share scheduling. It uses a scheduling algorithm, calledDistributed Weighted Round-Robin (DWRR), which retains the existingscheduler design as much as possible, and extends it to achieveproportional fairness with O(1) time complexity and a constant errorbound, compared to the ideal fair scheduling algorithm. The goal of Triois not to improve interactive performance; rather, it relies on theexisting scheduler for interactivity and extends it to support MPproportional fairness.
Trio has two unique features: (1) it enables users to control shares ofCPU time for any thread or group of threads (e.g., a process, anapplication, etc.), and (2) it enables fair sharing of CPU time acrossmultiple CPUs. For example, with ten tasks running on eight CPUs, Trioallows each task to take an equal fraction of the total CPU time. Thesefeatures enable Trio to complement the existing Linux scheduler toenable greater user flexibility and stronger fairness.
Background:
Over the years, there has been a lot of criticism that conventional Unixpriorities and the nice interface provide insufficient support for usersto accurately control CPU shares of different threads or applications.Many have studied scheduling algorithms that achieve proportionalfairness. Assuming that each thread has a weight that expresses itsdesired CPU share, informally, a scheduler is proportionally fair if (1)it is work-conserving, and (2) it allocates CPU time to threads in exactproportion to their weights in any time interval. Ideal proportionalfairness is impractical since it requires that all runnable threads berunning simultaneously and scheduled with infinitesimally small quanta.In practice, every proportional-share scheduling algorithm approximatesthe ideal algorithm with the goal of achieving a constant error bound.For more theoretical background, please refer to the following papers:

I don't think that achieving a constant error bound is always a good thing. Weall know that fairness has overhead. If I have 3 threads and 2 processors, andI have a choice between fairly giving each thread 1.0 billion cycles during thenext second, or unfairly giving two of them 1.1 billion cycles and giving theother 0.9 billion cycles, then we can have a useful discussion about where wewant to draw the line on the fairness/performance tradeoff. On the other hand,if we can give two of them 1.1 billion cycles and still give the other one 1.0billion cycles, it's madness to waste those 0.2 billion cycles just to avoiduser jealousy. The more complex the memory topology of a system, the more"free" cycles you'll get by tolerating short-term unfairness. As a crudeheuristic, scaling some fairly low tolerance by log2(NCPUS) seems appropriate,but eventually we should take the boot-time computed migration costs intoconsideration.

[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing
approach to flow control in integrated services networks: The single-node
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.
[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fairqueueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.
Previous proportional-share scheduling algorithms, however, suffer oneor more of the following problems:
(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of a global thread queue;
(4) Inefficient support for latency-sensitive applications.
Since the Linux scheduler has been successful at avoiding problems 2 to4, this design attempts to extend it with support for accurateproportional fairness while retaining all of its existing benefits.

If we allow a little short-term fairness (and I think we should) we can stillaccount for this unfairness and compensate for it (again, with the sametolerance) at the next rebalancing.

User Interface:
By default, each thread is assigned a weight proportional to its staticpriority. A set of system calls also allow users to specify a weight orreservation for any thread. Weights are relative. For example, for twothreads with weights 3 and 1, the scheduler ensures that the ratio oftheir CPU time is 3:1. Reservations are absolute and in the form of X%of the total CPU time. For example, a reservation of 80% for a threadmeans that the thread always receives at least 80% of the total CPU timeregardless of other threads.
The system calls also support specifying weights or reservations forgroups of threads. For example, one can specify an 80% reservation for agroup of threads (e.g., a process) to control the total CPU share towhich the member threads are collectively entitled. Within the group,the user can further specify local weights to different threads tocontrol their relative shares.

Adding system calls, while great for research, is not something which is donelightly in the published kernel. If we're going to implement a user interfacebeyond simply interpreting existing priorities more precisely, it would be niceif this was part of a framework with a broader vision, such as a scheduler economy.

Scheduling Algorithm:
The scheduler keeps a set data structures, called Trio groups, tomaintain the weight or reservation of each thread group (including oneor more threads) and the local weight of each member thread. Whenscheduling a thread, it consults these data structures and computes (inconstant time) a system-wide weight for the thread that represents anequivalent CPU share. Consequently, the scheduling algorithm, DWRR,operates solely based on the system-wide weight (or weight for short,hereafter) of each thread. Having a flat space of system-wide weightsfor individual threads avoids performing seperate scheduling at eachlevel of the group hierarchy and thus greatly simplies theimplementation for group scheduling.

Implementing a flat weight space efficiently is nontrivial. I'm curious to seehow you reworked the original patch without global locking.

For each processor, besides the existing active and expired arrays, DWRRkeeps one more array, called round-expired. It also keeps a round numberfor each processor, initially all zero. A thread is said to be in roundR if it is in the active or expired array of a round-R processor. Foreach thread, DWRR associates it with a round slice, equal to its weightmultiplied by a system constant, called base round slice, which controlsthe total time that the thread can run in any round. When a threadexhausts its time slice, as in the existing scheduler, DWRR moves it tothe expired array. However, when it exhausts its round slice, DWRR movesit to the round-expired array, indicating that the thread has finishedround R. In this way, all threads in the active and expired array on around-R processor are running in round R, while the threads in theround-expired array have finished round R and are awaiting to startround R+1. Threads in the active and expired arrays are scheduled thesame way as the existing scheduler.

I had a feeling this patch was originally designed for the O(1) scheduler, andthis is why. The old scheduler had expired arrays, so adding a round-expiredarray wasn't a radical departure from the design. CFS does not have an expiredrbtree, so adding one *is* a radical departure from the design. I think we canimplement DWRR or something very similar without using this implementationmethod. Since we've already got a tree of queued tasks, it might be easiest tobasically break off one subtree (usually just one task, but not necessarily) andmigrate it to a less loaded tree whenever we can reduce the difference betweenthe load on the two trees by at least half. This would prevent bothovercorrection and undercorrection.

When a processor's active array is empty, as usual, the active andexpired arrays are switched. When both active and expired are empty,DWRR eventually wants to switch the active and round-expired arrays,thus advancing the current processor to the next round. However, toguarantee fairness, it needs to maintain the invariant that thedifferences of all processors' rounds are bounded by a constant, wherethe smaller this constant is, the stronger fairness it can guarantee(the following assumes the constant is 1). With this invariant, it canbe shown that, during any time interval, the number of rounds that anytwo threads go through differs by the constant, which is key to ensuringDWRR's constant error bound compared to the ideal algorithm.
To enforce the above invariant, DWRR keeps track of the highest round(referred to as highest) among all processors at any time and ensuresthat no processor in round highest can advance to round highest+1 (thusupdating highest), if there exists at least one thread in the systemthat is still in round highest. There are at least two approaches tomaintain a global highest round variable. One is to associate it with aglobal lock to ensure consistency of its value. However, this may be notbe scalable. Thus, a second approach is to use no locking, but it couldlead to inconsistencies in the value. However, such inconsistenciesdon't affect correctness of the kernel and the only impact is that thefairness error of the scheduler can be twice as big as the lockingapproach, but the error is still bounded by a constant and thussufficient in most cases. The following describes the operations ofDWRR, assuming the locking approach, while the non-locking approachrequires only simple changes.

The idea of rounds was another implementation detail that bothered me. In theold scheduler, quantizing CPU time was a necessary evil. Now that we canaccount for CPU time with nanosecond resolution, doing things on an as-neededbasis seems more appropriate, and should reduce the need for global synchronization.

On any processor p, whenever both the active and expired arrays becomeempty, DWRR compares the round of p with highest. If equal, it performsidle load balancing in two steps: (1) It Identifies runnable threadsthat are in round highest but not currently running. Such threads can bein the active or expired array of a round highest processor, or in theround-expired array of a round highest - 1 processor. (2) Among thosethreads from step 1, move X of them to the active array of p, where X isa design choice and does not impact the fairness properties of DWRR. Ifstep 1 returns no suitable threads, DWRR proceeds as if the round ofprocessor p is less than highest, in which case DWRR switches p's activeand round-expired arrays, and increments p's round by one, thus allowingall threads in its round-expired array to advance to the next round.
Whenever the system creates a new thread or awakens an existing one,DWRR inserts the thread into the active array of an idle processor andsets the processor's round to the current value of highest. If no idleprocessor exists, it starts the thread on the least loaded processoramong those in round highest.
Whenever a processor goes idle (i.e., all of its three arrays areempty), DWRR resets its round to zero. Similar to the existingscheduler, DWRR also performs periodic load balancing but only amongprocessors in round highest. Unlike idle load balancing, periodic loadbalancing only improves performance and is not necessary for fairness.

In summary, I think the accounting is sound, but the enforcement is sub-optimalfor the new scheduler. A revision of the algorithm more cognizant of thecapabilities and design of the current scheduler would seem to be in order.

I've referenced many times my desire to account for CPU/memory hierarchy inthese patches. At present, I'm not sure we have sufficient infrastructure inthe kernel to automatically optimize for system topology, but I think whateverdesign we pursue should have some concept of this hierarchy, even if we end upusing a depth-1 tree in the short term while we figure out how to optimize this.

	-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Bill Huey (hui) <billh@gnuppy.monkey.org>

References:
- [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: "Li, Tong N" <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Chris Snook <csnook@redhat.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>

Prev by Date: Re: swap-prefetch: A smart way to make good use of idle resources (was: updatedb)
Next by Date: Re: Kernel modules compilation
Previous by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Next by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]