Re: [RFC] scheduler: improve SMP fairness in CFS

Tong Li wrote:

On Mon, 23 Jul 2007, Chris Snook wrote:
This patch is massive overkill. Maybe you're not seeing the overheadon your 8-way box, but I bet we'd see it on a 4096-way NUMA box with apartially-RT workload. Do you have any data justifying the need forthis patch?
Doing anything globally is expensive, and should be avoided at allcosts. The scheduler already rebalances when a CPU is idle, so you'rereally just rebalancing the overload here. On a server workload, wedon't necessarily want to do that, since the overload may be multiplethreads spawned to service a single request, and could be sharing alot of data.
Instead of an explicit system-wide fairness invariant (which will getvery hard to enforce when you throw SCHED_FIFO processes into the mixand the scheduler isn't running on some CPUs), try a simplerinvariant. If we guarantee that the load on CPU X does not differfrom the load on CPU (X+1)%N by more than some small constant, then weknow that the system is fairly balanced. We can achieve globalfairness with local balancing, and avoid all this overhead. This hasthe added advantage of keeping most of the migrationscore/socket/node-local on SMT/multicore/NUMA systems.
Chris,
These are all good comments. Thanks. I see three concerns and I'll tryto address each.
1. Unjustified effort/cost
My view is that fairness (or proportional fairness) is a first-ordermetric and necessary in many cases even at the cost of performance.

In the cases where it's critical, we have realtime. In the cases where it'simportant, this implementation won't keep latency low enough to make peoplehappier. If you've got a test case to prove me wrong, I'd like to see it.

Aserver running multiple client apps certainly doesn't want the clientsto see that they are getting different amounts of service, assuming theclients are of equal importance (priority).

A conventional server receives client requests, does a brief amount of work, andthen gives a response. This patch doesn't help that workload. This patch helpsthe case where you've got batch jobs running on a slightly overloaded computeserver, and unfairness means you end up waiting for a couple threads to finishat the end while CPUs sit idle. I don't think it's that big of a problem, andif it is, I think we can solve it in a more elegant way than reintroducingexpired queues.

When the clients havedifferent priorities, the server also wants to give them service timeproportional to their priority/weight. The same is true for desktops,where users want to nice tasks and see an effect that's consistent withwhat they expect, i.e., task CPU time should be proportional to theirnice values. The point is that it's important to enforce fairnessbecause it enables users to control the system in a deterministic wayand it helps each task get good response time. CFS achieves this onlocal CPUs and this patch makes the support stronger for SMPs. It'soverkill to enforce unnecessary degree of fairness, but it is necessaryto enforce an error bound, even if large, such that the user canreliably know what kind of CPU time (even performance) he'd get aftermaking a nice value change.


Doesn't CFS already do this?

This patch ensures an error bound of (maxtask weight currently in system) * sysctl_base_round_slice compared toan idealized fair system.

The thing that bugs me about this is the diminishing returns. It looks like itwill only give a substantial benefit when system load is somewhere between 1.0and 2.0. On a heavily-loaded system, CFS will do the right thing within a goodmargin of error, and on an underloaded system, even a naive scheduler will dothe right thing. If you want to optimize smp fairness in this range, that'sgreat, but there's probably a lighter-weight way to do it.

2. High performance overhead
Two sources of overhead: (1) the global rw_lock, and (2) taskmigrations. I agree they can be problems on NUMA, but I'd argue they arenot on SMPs. Any global lock can cause two performance problems: (1)serialization, and (2) excessive remote cache accesses and traffic. IMO(1) is not a problem since this is a rw_lock and a write_lock occursinfrequently only when all tasks in the system finish the current round.(2) could be a problem as every read/write lock causes an invalidation.It could be improved by using Nick's ticket lock. On the other hand,this is a single cache line and it's invalidated only when a CPUfinishes all tasks in its local active RB tree, where each nice 0 tasktakes sysctl_base_round_slice (e.g., 30ms) to finish, so it looks to methe invalidations would be infrequent enough and could be noise in thewhole system.

Task migrations don't bother me all that much. Since we're migrating the*overload*, I expect those processes to be fairly cache-cold whenever we getaround to them anyway. It'd be nice to be SMT/multicore/NUMA-smart about themigrations, but that's an implementation detail.

The global lock is what really bothers me. It's not just that it'll suck incertain rare cases (though I think it will), but it's really more the top-downdesign. If we take a bottom-up approach, first keeping fairness betweenneighbors, it'll function identically on the low-end systems, but it'll be ahuge win on the big systems, instead of a potential bottleneck.

The patch can introduce more task migrations. I don't think it's aproblem in SMPs. For one, CFS already is doing context switches morefrequently than before, and thus even if a task doesn't migration, itmay still miss in the local cache because the previous task kicked outits data. In this sense, migration doesn't add more cache misses. Now,in NUMA, the penalty of a cache miss can be much higher if we migrateoff-node. Here, I agree that this can be a problem. For NUMA, I'd expectthe user to tune sysctl_base_round_slice to a large enough value toavoid frequent migrations (or just affinitize tasks). Benchmarking couldalso help us determine a reasonable default sysctl_base_round_slice forNUMA.

The kernel should be tuned reasonably by default. The days in which NUMAimplied a supercomputer tuned by specialists are over. NUMA is commodity now.As a simple heuristic, I suggest multiplying your defaultsysctl_base_round_slice by (log2(num_nodes)+1). This scales the migration costproportionally on most simple topologies. People with gigantic torus-topologysupercomputers who still bring in specialists to tune everything might stillwant to raise it.

3. Hard to enforce fairness when there are non-SCHED_FAIR tasks
All we want is to enforce fairness among the SCHED_FAIR tasks.Leveraging CFS, this patch makes sure relative CPU time of twoSCHED_FAIR tasks in an SMP equals their weight ratio. In other words, ifthe entire SCHED_FAIR tasks are given X amount of CPU time, then aweight w task is guaranteed with w/X time in any interval of time.

The problem is that there are various reasons why a particular CPU might not getto execute this codepath for a little while, and then either your roundinvariant is broken, or you have a performance-killing barrier. Relaxing yourinvariants and your locking would mitigate this.

Concerns aside, I agree that fairness is important, and I'd really like to see atest case that demonstrates the problem.


	-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: "Chris Friesen" <[email protected]>

References:
- [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <[email protected]>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Chris Snook <[email protected]>
- Re: [RFC] scheduler: improve SMP fairness in CFS
  - From: Tong Li <[email protected]>

Prev by Date: [PATCH 01/11] mv64x60_wdt: set up platform_device in platform code
Next by Date: Re: 2.6.22-git17 boot failure
Previous by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Next by thread: Re: [RFC] scheduler: improve SMP fairness in CFS
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]