Chris Snook wrote:
Tong Li wrote:
This patch extends CFS to achieve better fairness for SMPs. For
example, with 10 tasks (same priority) on 8 CPUs, it enables each task
to receive equal CPU time (80%). The code works on top of CFS and
provides SMP fairness at a coarser time grainularity; local on each
CPU, it relies on CFS to provide fine-grained fairness and good
interactivity.
The code is based on the distributed weighted round-robin (DWRR)
algorithm. It keeps two RB trees on each CPU: one is the original
cfs_rq, referred to as active, and one is a new cfs_rq, called
round-expired. Each CPU keeps a round number, initially zero. The
scheduler works exactly the same way as in CFS, but only runs tasks
from the active tree. Each task is assigned a round slice, equal to
its weight times a system constant (e.g., 100ms), controlled by
sysctl_base_round_slice. When a task uses up its round slice, it moves
to the round-expired tree on the same CPU and stops running. Thus, at
any time on each CPU, the active tree contains all tasks that are
running in the current round, while tasks in round-expired have all
finished the current round and await to start the next round. When an
active tree becomes empty, it calls idle_balance() to grab tasks of
the same round from other CPUs. If none can be moved over, it switches
its active and round-expired trees, thus unleashing round-expired
tasks and advancing the local round number by one. An invariant it
maintains is that the round numbers of any two CPUs in the system
differ by at most one. This property ensures fairness across CPUs. The
variable sysctl_base_round_slice controls fairness-performance
tradeoffs: a smaller value leads to better cross-CPU fairness at the
potential cost of performance; on the other hand, the larger the value
is, the closer the system behavior is to the default CFS without the
patch.
Any comments and suggestions would be highly appreciated.
This patch is massive overkill. Maybe you're not seeing the overhead on
your 8-way box, but I bet we'd see it on a 4096-way NUMA box with a
partially-RT workload. Do you have any data justifying the need for
this patch?
Doing anything globally is expensive, and should be avoided at all
costs. The scheduler already rebalances when a CPU is idle, so you're
really just rebalancing the overload here. On a server workload, we
don't necessarily want to do that, since the overload may be multiple
threads spawned to service a single request, and could be sharing a lot
of data.
Instead of an explicit system-wide fairness invariant (which will get
very hard to enforce when you throw SCHED_FIFO processes into the mix
and the scheduler isn't running on some CPUs), try a simpler invariant.
If we guarantee that the load on CPU X does not differ from the load on
CPU (X+1)%N by more than some small constant, then we know that the
system is fairly balanced. We can achieve global fairness with local
balancing, and avoid all this overhead. This has the added advantage of
keeping most of the migrations core/socket/node-local on
SMT/multicore/NUMA systems.
-- Chris
To clarify, I'm not suggesting that the "balance with cpu (x+1)%n only"
algorithm is the only way to do this. Rather, I'm pointing out that
even an extremely simple algorithm can give you fair loading when you
already have CFS managing the runqueues. There are countless more
sophisticated ways we could do this without using global locking, or
possibly without any locking at all, other than the locking we already
use during migration.
-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]