Re: [patch] CFS scheduler, v3

William Lee Irwin III wrote:

On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I have a suggestion I'd like to make that addresses both nice andfairness at the same time. As I understand the basic principle behindthis scheduler it to work out a time by which a task should make it ontothe CPU and then place it into an ordered list (based on this value) oftasks waiting for the CPU. I think that this is a great idea and mysuggestion is with regard to a method for working out this time thattakes into account both fairness and nice.
Hmm. Let me take a look...


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
First suppose we have the following metrics available in addition towhat's already provided.
rq->avg_weight_load /* a running average of the weighted load on the CPU */
p->avg_cpu_per_cycle /* the average time in nsecs that p spends on theCPU each scheduling cycle */
I'm suspicious of mean service times not paired with mean inter-arrival
times.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
where a scheduling cycle for a task starts when it is placed on thequeue after waking or being preempted and ends when it is taken off theCPU either voluntarily or after being preempted. Sop->avg_cpu_per_cycle is just the average amount of time p spends on theCPU each time it gets on to the CPU. Sorry for the long explanationhere but I just wanted to make sure there was no chance that "schedulingcycle" would be construed as some mechanism being imposed on the scheduler.)
I went and looked up priority queueing queueing theory garbage and
re-derived various things I needed. The basics check out. Probably no
one cares that I checked.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
We can then define:
effective_weighted_load = max(rq->raw_weighted_load, rq->avg_weighted_load)
If p is just waking (i.e. it's not on the queue and its load_weight isnot included in rq->raw_weighted_load) and we need to queue it, we saythat the maximum time (in all fairness) that p should have to wait toget onto the CPU is:expected_wait = p->avg_cpu_per_cycle * effective_weighted_load /p->load_weight

You're right. The time that the task spent sleeping before being wokenshould be subtracted from this value. If the answer is less than orequal to zero pre-emption should occur.

This doesn't look right, probably because the scaling factor of
p->avg_cpu_per_cycle is the reciprocal of its additive contribution to
the ->avg_weight_load as opposed to a direct estimate of its initial
delay or waiting time before completing its current requested service.

p->load_weight/effective_weighted_load more properly represents an
entitlement to CPU bandwidth.

Yes.  But expected_wait isn't entitlement it's its inverse.

	p->avg_cpu_per_cycle/(p->load_weight/effective_weighted_load)
would be more like the expected time spent on the runqueue

When I went to school that would be just another way of expressing theequation that I expressed.

(whether
waiting to run or actually running) for a time interval spent in a
runnable state and the expected time runnable and waiting to run in such
an interval would be
	p->avg_cpu_per_cycle*(1-effective_weighted_load/p->load_weight),

Neither represents the initial delay between entering the runqeueue and
first acquiring the CPU, but that's a bit hard to figure out without
deciding the scheduling policy up-front anyway.

This essentially doesn't look correct because while you want to enforce
the CPU bandwidth allocation, this doesn't have much to do with that
apart from the CPU bandwidth appearing as a term. It's more properly
a rate of service as opposed to a time at which anything should happen
or a number useful for predicting such. When service should begin more
properly depends on the other tasks in the system and a number of other
decisions that are part of the scheduling policy.

This model takes all of those into consideration. The idea is not justto predict but to use the calculated time to decide when to boot thecurrent process of the CPU (if it doesn't leave voluntarily) and putthis one on. This more or less removes the need to give each task apredetermined chunk of CPU when they go on to the CPU. This should, ingeneral, reduce the number context switches as tasks get to run untilthey've finished what they're doing or another task becomes higherpriority rather than being booted off after an arbitrary time interval.(If this ever gets tried it will be interesting to see if thisprediction comes true.)

BTW Even if Ingo doesn't choose to try this model, I'll probably make apatch (way in the future after Ingo's changes are settled) to try it outmyself.

If you want to choose a "quasi-inter-arrival time" to achieve the
specified CPU bandwidth allocation, this would be it, but to use that
to actually enforce the CPU bandwidth allocation, you would need to
take into account the genuine inter-arrival time to choose an actual
time for service to begin. In other words, this should be a quota for
the task to have waited. If it's not waited long enough, then it should
be delayed by the difference to achieve the inter-arrival time you're
trying to enforce. If it's waited longer, it should be executed
sooner modulo other constraints, and perhaps even credited for future
scheduling cycles.

The idea isn't to enforce the bandwidth entitlement to the extent ofthrottling tasks if they exceed their entitlement and there's no othertasks ready to use the CPU. This is mainly because the bandwidthentitlement isn't fixed -- it's changing constantly as the number andtype of runnable tasks changes.

In order to partially avoid underallocating CPU bandwidth to p, one
should track the time last spent sleeping and do the following:

Yes I made a mistake in omitting to take into account sleep interval.See another e-mail to Ingo correcting this problem.

	last_sleep = now - p->last_went_to_sleep;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	wait = wait > last_sleep ? wait - last_sleep : 0;

By and large the scheduler can deterministically choose waits on the
runqueue but it doesn't actually do that. Some additional corrections
for delays beyond those decided up-front while on the runqueue or
getting scheduled early on the runqueue may also help, though I'm not
as interested in them as I am the one for sleep:
	last_wait = now - p->last_taken_off_the_cpu;
	wait = p->avg_cpu_per_cycle*effective_weighted_load/p->load_weight;
	if (wait > last_wait) /* didn't wait long enough? */
		wait += wait - last_wait; /* wait the difference longer */
	else if (wait < last_wait) /* waited too long? */
		/* wait less to make up the difference */
		wait -= wait > last_wait - wait ? last_wait - wait : wait;
Where ->last_taken_off_the_cpu represents the time the task was last
taken off the CPU for whatever reason (this may need to be done
differently to handle time variables).

In order to do better, longer-term history is required,

The half life of the Kalman filter (roughly equivalent to a runningaverage) used to calculate the averages determines how much history istaken into account. It could be made configurable (at least, untilenough real life experience was available to decide on the best value touse).

but it can't be
a truly infinite history.

I agree.

To attempt to maintain an infinite history of
bandwidth underutilization to be credited too far in the future would
enable potentially long-term overutilization when such credits are
cashed en masse for a sustained period of time. At some point you have
to say "use it or lose it;" over a shorter period of time some smoothing
is still admissible and even desirable.

Yes, that's why I suggest a running average over the last few schedulingcycles for the task. But thinking about it some more I'm now not sosure. The lack of apparent "smoothness" when I've done this sort ofthing with raw rather than smooth data (in modifications to the currentdynamic priority based scheduler model) is usually noticed by runningtop and seeing wildly fluctuating dynamic priorities. I'm not sure thatthe actual responsiveness of the system reflects this. So I'm nowwilling to reserve my judgement on this issue.

One might also consider debits owing to non-preemptibility or other
sorts of cpu bandwidth overutilization with similar caveats. A half-
life to an otherwise infinite history of credits and/or debits
specified in absolute time may be appropriate, though it's not quite as
computationally cheap as the above. The accounting for these credits
and debits would take the place of ->last_taken_off_the_cpu above.

Credits shouldn't be necessary with this model if, instead of using theaverage "on CPU" time per cycle for a pre-empted task when requeuing it,you use the time period that it was "an CPU" before it was booted off asthis should compensate it correctly.

Another attack on the problem of CPU bandwidth misallocation would be
to further credit or debit the task according to the ratio of actual
CPU bandwidth to allocated CPU bandwidth in order to compensate
for prior failures to enforce the allocation.

The fact that the allocated CPU bandwidth is changing continuously makesthis not a good idea. On a system where allocations were made directlyrather being a side effect of "nice" and the number of runnableprocesses it might be the way to go.

Unfortunately, the result
of scaling the artificial inter-arrival time in the obvious way is
degenerate (independent of priority) so it can't be done that simply.
Perhaps it could be used solely to adjust the credit and debit history
contribution to the quasi-inter-arrival time or the difference used
somehow, but I don't care to make more complex proposals than the
first-order correction above (if it's even worthy of being called a
proposal or a correction) for either this or a time-decaying credit
and debit history or even debit accounting at all.

I'd be happy enough to see the correction term to subtract out sleeping
time (i.e. the code snippet with last_sleep).

Yes, this is THE important deficiency in my proposal.

The rest and/or stronger
attempts to factor in sleeping time or bandwidth misallocation I don't
consider so significant, at least not without some demonstration there
is CPU bandwidth misallocation happening.


On Fri, Apr 20, 2007 at 10:10:45AM +1000, Peter Williams wrote:
I appreciate that the notion of basing the expected wait on the task'saverage cpu use per scheduling cycle is counter intuitive but I believethat (if you think about it) you'll see that it actually makes sense.
I don't know. It sounds rather standard to me.

That's good to hear.

I also think that (as Ingo would probably prefer) that this mechanismmight work without using averages i.e. just use the last "on CPU" time.If it does work it would have the advantage of being impervious to thepossible bad effects of CPU speed changes on those systems where this isa feature. When using averages, I think that special considerationwould need to be made for variable cpu speeds -- probably not thatdifficult but additional overhead anyway.

I've been giving some thought into a mechanism for pre-empting thecurrent task when another task's expected "on CPU" time arrives. Ithink that it would be sufficient for scheduler_tick() to compare thecurrent time (via sched_clock() or whatever) with the "on CPU" for thenext task on the queue and initiation pre-emption if it's later thanthat time. This increases the granularity a little but is compensatedfor by the fact it will help with the situation where more than one taskon the queue has the same expected "on CPU" time in that each of themwould get at least a tick (if they needed it).

If some form of precise timer was used (instead) to trigger pre-emptionthen, where there is more than one task with the same expected "on CPU"time, only the last would get any CPU (and that's only the case if someinfinite loop doesn't get created).

Peter
--
Peter Williams                                   pwil3058@bigpond.net.au

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [patch] CFS scheduler, v3
  - From: William Lee Irwin III <wli@holomorphy.com>

References:
- [patch] CFS scheduler, v3
  - From: Ingo Molnar <mingo@elte.hu>
- Re: [patch] CFS scheduler, v3
  - From: Peter Williams <pwil3058@bigpond.net.au>
- Re: [patch] CFS scheduler, v3
  - From: William Lee Irwin III <wli@holomorphy.com>

Prev by Date: [RFC PATCH - Try #2] Re: BUG in sysfs_remove_group
Next by Date: Re: [PATCH] fix ext2 allocator overflows above 31 bit blocks
Previous by thread: Re: [patch] CFS scheduler, v3
Next by thread: Re: [patch] CFS scheduler, v3
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]