Re: [patch] CFS scheduler, v3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



William Lee Irwin III wrote:
On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I have a suggestion I'd like to make that addresses both nice and fairness at the same time. As I understand the basic principle behind this scheduler it to work out a time by which a task should make it onto the CPU and then place it into an ordered list (based on this value) of tasks waiting for the CPU. I think that this is a great idea and my suggestion is with regard to a method for working out this time that takes into account both fairness and nice.

Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
First suppose we have the following metrics available in addition to what's already provided.
rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on the CPU each scheduling cycle */

I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
where a scheduling cycle for a task starts when it is placed on the queue after waking or being preempted and ends when it is taken off the CPU either voluntarily or after being preempted. So p->avg_cpu_per_cycle is just the average amount of time p spends on the CPU each time it gets on to the CPU. Sorry for the long explanation here but I just wanted to make sure there was no chance that "scheduling cycle" would be construed as some mechanism being imposed on the scheduler.)

I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
We can then define:
effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
If p is just waking (i.e. it's not on the queue and its load_weight is not included in rq->raw_weighted_load) and we need to queue it, we say that the maximum time (in all fairness) that p should have to wait to get onto the CPU is: expected_wait = p->avg_cpu_per_cycle * effective_weighted_load / p->load_weight

You're right. The time that the task spent sleeping before being woken should be subtracted from this value. If the answer is less than or equal to zero pre-emption should occur.


This doesn't look right, probably because the scaling factor of
p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
the ->avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p->load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.

Yes.  But expected_wait isn't entitlement it's its inverse.

	p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue

When I went to school that would be just another way of expressing the equation that I expressed.

(whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
	p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.

This model takes all of those into consideration. The idea is not just to predict but to use the calculated time to decide when to boot the current process of the CPU (if it doesn't leave voluntarily) and put this one on. This more or less removes the need to give each task a predetermined chunk of CPU when they go on to the CPU. This should, in general, reduce the number context switches as tasks get to run until they've finished what they're doing or another task becomes higher priority rather than being booted off after an arbitrary time interval. (If this ever gets tried it will be interesting to see if this prediction comes true.)

BTW Even if Ingo doesn't choose to try this model, I'll probably make a patch (way in the future after Ingo's changes are settled) to try it out myself.


If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.

The idea isn't to enforce the bandwidth entitlement to the extent of throttling tasks if they exceed their entitlement and there's no other tasks ready to use the CPU. This is mainly because the bandwidth entitlement isn't fixed -- it's changing constantly as the number and type of runnable tasks changes.


In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:

Yes I made a mistake in omitting to take into account sleep interval. See another e-mail to Ingo correcting this problem.

	last_sleep = now - p->last_went_to_sleep;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	wait = wait > last_sleep ? wait - last_sleep : 0;

By and large the scheduler can deterministically choose waits on the
runqueue but it doesn't actually do that. Some additional corrections
for delays beyond those decided up-front while on the runqueue or
getting scheduled early on the runqueue may also help, though I'm not
as interested in them as I am the one for sleep:
	last_wait = now - p->last_taken_off_the_cpu;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	if (wait > last_wait) /* didn't wait long enough? */
		wait += wait - last_wait; /* wait the difference longer */
	else if (wait < last_wait) /* waited too long? */
		/* wait less to make up the difference */
		wait -= wait > last_wait - wait ? last_wait - wait : wait;
Where ->last_taken_off_the_cpu represents the time the task was last
taken off the CPU for whatever reason (this may need to be done
differently to handle time variables).

In order to do better, longer-term history is required,

The half life of the Kalman filter (roughly equivalent to a running average) used to calculate the averages determines how much history is taken into account. It could be made configurable (at least, until enough real life experience was available to decide on the best value to use).

but it can't be
a truly infinite history.

I agree.

To attempt to maintain an infinite history of
bandwidth underutilization to be credited too far in the future would
enable potentially long-term overutilization when such credits are
cashed en masse for a sustained period of time. At some point you have
to say "use it or lose it;" over a shorter period of time some smoothing
is still admissible and even desirable.

Yes, that's why I suggest a running average over the last few scheduling cycles for the task. But thinking about it some more I'm now not so sure. The lack of apparent "smoothness" when I've done this sort of thing with raw rather than smooth data (in modifications to the current dynamic priority based scheduler model) is usually noticed by running top and seeing wildly fluctuating dynamic priorities. I'm not sure that the actual responsiveness of the system reflects this. So I'm now willing to reserve my judgement on this issue.


One might also consider debits owing to non-preemptibility or other
sorts of cpu bandwidth overutilization with similar caveats. A half-
life to an otherwise infinite history of credits and/or debits
specified in absolute time may be appropriate, though it's not quite as
computationally cheap as the above. The accounting for these credits
and debits would take the place of ->last_taken_off_the_cpu above.

Credits shouldn't be necessary with this model if, instead of using the average "on CPU" time per cycle for a pre-empted task when requeuing it, you use the time period that it was "an CPU" before it was booted off as this should compensate it correctly.


Another attack on the problem of CPU bandwidth misallocation would be
to further credit or debit the task according to the ratio of actual
CPU bandwidth to allocated CPU bandwidth in order to compensate
for prior failures to enforce the allocation.

The fact that the allocated CPU bandwidth is changing continuously makes this not a good idea. On a system where allocations were made directly rather being a side effect of "nice" and the number of runnable processes it might be the way to go.

Unfortunately, the result
of scaling the artificial inter-arrival time in the obvious way is
degenerate (independent of priority) so it can't be done that simply.
Perhaps it could be used solely to adjust the credit and debit history
contribution to the quasi-inter-arrival time or the difference used
somehow, but I don't care to make more complex proposals than the
first-order correction above (if it's even worthy of being called a
proposal or a correction) for either this or a time-decaying credit
and debit history or even debit accounting at all.

I'd be happy enough to see the correction term to subtract out sleeping
time (i.e. the code snippet with last_sleep).

Yes, this is THE important deficiency in my proposal.

The rest and/or stronger
attempts to factor in sleeping time or bandwidth misallocation I don't
consider so significant, at least not without some demonstration there
is CPU bandwidth misallocation happening.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I appreciate that the notion of basing the expected wait on the task's average cpu use per scheduling cycle is counter intuitive but I believe that (if you think about it) you'll see that it actually makes sense.

I don't know. It sounds rather standard to me.

That's good to hear.

I also think that (as Ingo would probably prefer) that this mechanism might work without using averages i.e. just use the last "on CPU" time. If it does work it would have the advantage of being impervious to the possible bad effects of CPU speed changes on those systems where this is a feature. When using averages, I think that special consideration would need to be made for variable cpu speeds -- probably not that difficult but additional overhead anyway.

I've been giving some thought into a mechanism for pre-empting the current task when another task's expected "on CPU" time arrives. I think that it would be sufficient for scheduler_tick() to compare the current time (via sched_clock() or whatever) with the "on CPU" for the next task on the queue and initiation pre-emption if it's later than that time. This increases the granularity a little but is compensated for by the fact it will help with the situation where more than one task on the queue has the same expected "on CPU" time in that each of them would get at least a tick (if they needed it).

If some form of precise timer was used (instead) to trigger pre-emption then, where there is more than one task with the same expected "on CPU" time, only the last would get any CPU (and that's only the case if some infinite loop doesn't get created).

Peter
--
Peter Williams                                   [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux