Re: [RFC] scheduler: improve SMP fairness in CFS

Tong Li wrote:

This patch extends CFS to achieve better fairness for SMPs. For example,with 10 tasks (same priority) on 8 CPUs, it enables each task to receiveequal CPU time (80%). The code works on top of CFS and provides SMPfairness at a coarser time grainularity; local on each CPU, it relies onCFS to provide fine-grained fairness and good interactivity.
The code is based on the distributed weighted round-robin (DWRR)algorithm. It keeps two RB trees on each CPU: one is the originalcfs_rq, referred to as active, and one is a new cfs_rq, calledround-expired. Each CPU keeps a round number, initially zero. Thescheduler works exactly the same way as in CFS, but only runs tasks fromthe active tree. Each task is assigned a round slice, equal to itsweight times a system constant (e.g., 100ms), controlled bysysctl_base_round_slice. When a task uses up its round slice, it movesto the round-expired tree on the same CPU and stops running. Thus, atany time on each CPU, the active tree contains all tasks that arerunning in the current round, while tasks in round-expired have allfinished the current round and await to start the next round. When anactive tree becomes empty, it calls idle_balance() to grab tasks of thesame round from other CPUs. If none can be moved over, it switches itsactive and round-expired trees, thus unleashing round-expired tasks andadvancing the local round number by one. An invariant it maintains isthat the round numbers of any two CPUs in the system differ by at mostone. This property ensures fairness across CPUs. The variablesysctl_base_round_slice controls fairness-performance tradeoffs: asmaller value leads to better cross-CPU fairness at the potential costof performance; on the other hand, the larger the value is, the closerthe system behavior is to the default CFS without the patch.
Any comments and suggestions would be highly appreciated.

This patch is massive overkill. Maybe you're not seeing the overhead on your8-way box, but I bet we'd see it on a 4096-way NUMA box with a partially-RTworkload. Do you have any data justifying the need for this patch?

Doing anything globally is expensive, and should be avoided at all costs. Thescheduler already rebalances when a CPU is idle, so you're really justrebalancing the overload here. On a server workload, we don't necessarily wantto do that, since the overload may be multiple threads spawned to service asingle request, and could be sharing a lot of data.

Instead of an explicit system-wide fairness invariant (which will get very hardto enforce when you throw SCHED_FIFO processes into the mix and the schedulerisn't running on some CPUs), try a simpler invariant. If we guarantee that theload on CPU X does not differ from the load on CPU (X+1)%N by more than somesmall constant, then we know that the system is fairly balanced. We can achieveglobal fairness with local balancing, and avoid all this overhead. This has theadded advantage of keeping most of the migrations core/socket/node-local onSMT/multicore/NUMA systems.

	-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Chris Snook <csnook@redhat.com>

References:
- [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <tong.n.li@intel.com>

Prev by Date: RE: __unsafe() usage
Next by Date: Re: __unsafe() usage
Previous by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Next by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]