Re: [RFC] scheduler: improve SMP fairness in CFS

On Fri, 27 Jul 2007, Chris Snook wrote:

Tong Li wrote:
I'd like to clarify that I'm not trying to push this particular code to thekernel. I'm a researcher. My intent was to point out that we have a problemin the scheduler and my dwrr algorithm can potentially help fix it. Thepatch itself was merely a proof-of-concept. I'd be thrilled if thealgorithm can be proven useful in the real world. I appreciate the peoplewho have given me comments. Since then, I've revised my algorithm/code. Nowit doesn't require global locking but retains strong fairness properties(which I was able to prove mathematically).
Thanks for doing this work. Please don't take the implementation criticismas a lack of appreciation for the work. I'd like to see dwrr in thescheduler, but I'm skeptical that re-introducing expired runqueues is themost efficient way to do it.
Given the inherently controversial nature of scheduler code, particularlythat which attempts to enforce fairness, perhaps a concise design documentwould help us come to an agreement about what we think the scheduler shoulddo and what tradeoffs we're willing to make to do those things. Do you havea design document we could discuss?
	-- Chris

Thanks for the interest. Attached is a design doc I wrote several monthsago (with small modifications). It talks about the two pieces of mydesign: group scheduling and dwrr. The description was based on theoriginal O(1) scheduler, but as my CFS patch showed, the algorithm isapplicable to other underlying schedulers as well. It's interesting that Istarted working on this in January for the purpose of eventually writing apaper about it. So I knew reasonably well the related research work butwas totally unaware that people in the Linux community were also workingon similar things. This is good. If you are interested, I'd like to helpwith the algorithms and theory side of the things.

  tong

-------------------------------------------
Overview:

Trio extends the existing Linux scheduler with support forproportional-share scheduling. It uses a scheduling algorithm, calledDistributed Weighted Round-Robin (DWRR), which retains the existingscheduler design as much as possible, and extends it to achieveproportional fairness with O(1) time complexity and a constant errorbound, compared to the ideal fair scheduling algorithm. The goal of Triois not to improve interactive performance; rather, it relies on theexisting scheduler for interactivity and extends it to support MPproportional fairness.

Trio has two unique features: (1) it enables users to control shares ofCPU time for any thread or group of threads (e.g., a process, anapplication, etc.), and (2) it enables fair sharing of CPU time acrossmultiple CPUs. For example, with ten tasks running on eight CPUs, Trioallows each task to take an equal fraction of the total CPU time. Thesefeatures enable Trio to complement the existing Linux scheduler to enablegreater user flexibility and stronger fairness.

Background:

Over the years, there has been a lot of criticism that conventional Unixpriorities and the nice interface provide insufficient support for usersto accurately control CPU shares of different threads or applications.Many have studied scheduling algorithms that achieve proportionalfairness. Assuming that each thread has a weight that expresses itsdesired CPU share, informally, a scheduler is proportionally fair if (1)it is work-conserving, and (2) it allocates CPU time to threads in exactproportion to their weights in any time interval. Ideal proportionalfairness is impractical since it requires that all runnable threads berunning simultaneously and scheduled with infinitesimally small quanta. Inpractice, every proportional-share scheduling algorithm approximates theideal algorithm with the goal of achieving a constant error bound. Formore theoretical background, please refer to the following papers:

[1] A. K. Parekh and R. G. Gallager. A generalized processor sharing
approach to flow control in integrated services networks: The single-node
case. IEEE/ACM Transactions on Networking, 1(3):344-357, June 1993.

[2] C. R. Bennett and H. Zhang. WF2Q: Worst-case fair weighted fairqueueing. In Proceedings of IEEE INFOCOM '94, pages 120-128, Mar. 1996.

Previous proportional-share scheduling algorithms, however, suffer one ormore of the following problems:

(1) Inaccurate fairness with non-constant error bounds;
(2) High run-time overhead (e.g., logarithmic);
(3) Poor scalability due to the use of a global thread queue;
(4) Inefficient support for latency-sensitive applications.

Since the Linux scheduler has been successful at avoiding problems 2 to 4,this design attempts to extend it with support for accurate proportionalfairness while retaining all of its existing benefits.

User Interface:

By default, each thread is assigned a weight proportional to its staticpriority. A set of system calls also allow users to specify a weight orreservation for any thread. Weights are relative. For example, for twothreads with weights 3 and 1, the scheduler ensures that the ratio oftheir CPU time is 3:1. Reservations are absolute and in the form of X% ofthe total CPU time. For example, a reservation of 80% for a thread meansthat the thread always receives at least 80% of the total CPU timeregardless of other threads.

The system calls also support specifying weights or reservations forgroups of threads. For example, one can specify an 80% reservation for agroup of threads (e.g., a process) to control the total CPU share to whichthe member threads are collectively entitled. Within the group, the usercan further specify local weights to different threads to control theirrelative shares.

Scheduling Algorithm:

The scheduler keeps a set data structures, called Trio groups, to maintainthe weight or reservation of each thread group (including one or morethreads) and the local weight of each member thread. When scheduling athread, it consults these data structures and computes (in constant time)a system-wide weight for the thread that represents an equivalent CPUshare. Consequently, the scheduling algorithm, DWRR, operates solely basedon the system-wide weight (or weight for short, hereafter) of each thread.Having a flat space of system-wide weights for individual threads avoidsperforming seperate scheduling at each level of the group hierarchy andthus greatly simplies the implementation for group scheduling.

For each processor, besides the existing active and expired arrays, DWRRkeeps one more array, called round-expired. It also keeps a round numberfor each processor, initially all zero. A thread is said to be in round Rif it is in the active or expired array of a round-R processor. For eachthread, DWRR associates it with a round slice, equal to its weightmultiplied by a system constant, called base round slice, which controlsthe total time that the thread can run in any round. When a threadexhausts its time slice, as in the existing scheduler, DWRR moves it tothe expired array. However, when it exhausts its round slice, DWRR movesit to the round-expired array, indicating that the thread has finishedround R. In this way, all threads in the active and expired array on around-R processor are running in round R, while the threads in theround-expired array have finished round R and are awaiting to start roundR+1. Threads in the active and expired arrays are scheduled the same wayas the existing scheduler.

When a processor's active array is empty, as usual, the active and expiredarrays are switched. When both active and expired are empty, DWRReventually wants to switch the active and round-expired arrays, thusadvancing the current processor to the next round. However, to guaranteefairness, it needs to maintain the invariant that the differences of allprocessors' rounds are bounded by a constant, where the smaller thisconstant is, the stronger fairness it can guarantee (the following assumesthe constant is 1). With this invariant, it can be shown that, during anytime interval, the number of rounds that any two threads go throughdiffers by the constant, which is key to ensuring DWRR's constant errorbound compared to the ideal algorithm.

To enforce the above invariant, DWRR keeps track of the highest round(referred to as highest) among all processors at any time and ensures thatno processor in round highest can advance to round highest+1 (thusupdating highest), if there exists at least one thread in the system thatis still in round highest. There are at least two approaches to maintain aglobal highest round variable. One is to associate it with a global lockto ensure consistency of its value. However, this may be not be scalable.Thus, a second approach is to use no locking, but it could lead toinconsistencies in the value. However, such inconsistencies don't affectcorrectness of the kernel and the only impact is that the fairness errorof the scheduler can be twice as big as the locking approach, but theerror is still bounded by a constant and thus sufficient in most cases.The following describes the operations of DWRR, assuming the lockingapproach, while the non-locking approach requires only simple changes.

On any processor p, whenever both the active and expired arrays becomeempty, DWRR compares the round of p with highest. If equal, it performsidle load balancing in two steps: (1) It Identifies runnable threads thatare in round highest but not currently running. Such threads can be in theactive or expired array of a round highest processor, or in theround-expired array of a round highest - 1 processor. (2) Among thosethreads from step 1, move X of them to the active array of p, where X is adesign choice and does not impact the fairness properties of DWRR. If step1 returns no suitable threads, DWRR proceeds as if the round of processorp is less than highest, in which case DWRR switches p's active andround-expired arrays, and increments p's round by one, thus allowing allthreads in its round-expired array to advance to the next round.

Whenever the system creates a new thread or awakens an existing one, DWRRinserts the thread into the active array of an idle processor and sets theprocessor's round to the current value of highest. If no idle processorexists, it starts the thread on the least loaded processor among those inround highest.

Whenever a processor goes idle (i.e., all of its three arrays are empty),DWRR resets its round to zero. Similar to the existing scheduler, DWRRalso performs periodic load balancing but only among processors in roundhighest. Unlike idle load balancing, periodic load balancing only improvesperformance and is not necessary for fairness.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Chris Snook <csnook@redhat.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Bill Huey (hui) <billh@gnuppy.monkey.org>

References:
- [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: "Li, Tong N" <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Chris Snook <csnook@redhat.com>

Prev by Date: Re: Source organization for two drivers sharing coomon code
Next by Date: Re: [PATCH] Merge the Sonics Silicon Backplane subsystem
Previous by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Next by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]