Re: [RFC] (How to) Let idle CPUs sleep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[ This is a continuation of the thread at : http://lkml.org/lkml/2005/5/7/98 ]

On Sun, May 08, 2005 at 02:14:23PM +1000, Nick Piggin wrote:
> Rusty Russell wrote:
> >On Sat, 2005-05-07 at 23:57 +0530, Srivatsa Vaddagiri wrote:
> >
> >>Two solutions have been proposed so far:
> >>
> >>	A. As per Nick's suggestion, impose a max limit (say some 100 ms or
> >>	   say a second, Nick?) on how long a idle CPU can avoid taking
> 
> Yeah probably something around that order of magnitude. I suspect
> there will fast be a point where either you'll get other timers
> going off more frequently, and / or you simply get very quickly
> diminishing returns on the amount of power saving gained from
> increasing the period.

I did some measurements on the average interval that a idle CPU is allowed to
sleep before it wakes up. I also have some numbers on Xen comparing performance 
between the two approaches (100ms limit vs no limit on sleep interval). 

First, the avg sleep interval. 

Numbers were obtained on this configuration:

	Machine  : 8-way Intel (P3) box
	Distro   : RHEL3 
	Kernel	 : 2.6.12 

Each time a CPU became idle, next_timer_interrupt was called to find the next 
nearest timer (sleep_time_req) and the CPU slept till either the time elapsed 
or it was interrupted by a interrupt (external/device, reschedule, invalidate 
or call_function). When any of these happen, the actual amount of time slept 
in jiffies (sleep_time_real) was measured. In order to measure what effect 
internal system events (reschedule/invalidate/call-function IPI) alone have on 
avg sleep interval, I redirected all external interrupts to CPU 0.

Also the numbers were obtained at init runlevel 3 with system mostly idle.
I basically logged in and ran the test to get the numbers. Apart from
this test, the only other workloads active were the system daemons that are 
initiated at runlevel 3. The test cleared various per-cpu counters (which were 
used to capture sleep_time_real numbers), slept for 2 minutes and read the 
counters at the end of it.

Below are the numbers I obtained from the test:


CPU#   # of	  Mean		Std Dev	    Max			Min
       samples	

1:     113       1088.690      761.352       2001.000          2.000
2:      76       1633.197      640.230       2002.000         94.000
3:      80       1552.300      682.050       2003.000          1.000
4:      71       1735.606      531.744       2004.000        165.000
5:    2525         48.669        6.391         50.000          2.000
6:      74       1652.905      601.908       2006.000         97.000
7:    1783         68.915       33.495        100.000          0.000

[ CPU0 is excluded because its avg sleep interval is just 1 jiffie - 
  The external jiffie timer keeps waking it every jiffy ]

The mean above represents sleep_time_real, i.e the real time a CPU
was allowed to sleep before something woke it. The best mean
time was for CPU4 (1735.606 jiffies or ~1.7 sec). Also note that
the maximum time it actually could sleep was ~2 sec.

Digging further revealed that this max time was restricted by 
various timers kernel uses. Mostly it was found to be because of
the slab allocator reap timer (it requests a timer every ~2sec on
every CPU) and machine_check timer (MCE_RATE in arch/i386/kernel/cpu/mcheck/
non-fatal.c ).

Just to see the effect of increasing the timeouts of these timers, 
I changed MCE_RATE to 60*HZ (from 15*HZ), REAPTIMEOUT_CPUC to 20*HZ (from 2*HZ),
and REAPTIMEOUT_LIST3 to 40*HZ (from 4*HZ). With this change and reruning
my test under the same conditions, the new numbers that were obtained for 
sleep_time_real are:



CPU#   # of	  Mean		Std Dev	    Max			Min
       samples	

1:      24       5229.833     7341.283      20001.000          4.000
2:    2520         49.797        2.745         50.000          3.000
3:       8      15002.250     8575.149      20003.000       1127.000
4:       8      15002.875     8557.532      20004.000       1156.000
5:       8      15003.750     8537.711      20005.000       1189.000
6:     996        125.923     1209.668      30000.000          2.000
7:     134        940.269      176.835       1060.000        310.000


Note that now the best average sleep_time_real is now 15002.875 (~15 sec) 
for CPU4 and max was 20004.000 (~20 sec).

Now I am not saying that we can blindly increase these timeouts
(REAPTIMEOUT_CPUC/REAPTIMEOUT_LIST3/MCE_RATE). But I think that we can
make these timers idle-cpu aware. For example, if the slab timer notices 
that it is running on a idle CPU and the numbers of slabs it is actually 
reaping  have come down very much, then it can request for longer timeouts
instead of every ~2sec the way it does now. 

It is probably good for virtualization/power-management env that we identify
such short timeouts requested in kernel and make them idle-cpu aware to 
the extend possible. In that case, restricting max sleep time to 100ms, IMO,
is not going to be efficient.

In order to see the benefit of sleeping longer (and rather than waking
every 100ms), I ran some benchmarks on Xen. Basically I measured
the throughput of a CPU-intensive benchmark (which finds first 30 random prime
numbers). Time taken to find the 30 prime numbers was used as a benchmark
to compare the two approaches.

Configuration used was:

	Machine 	: 4-way Intel Xeon box (HT disabled in BIOS)
	Distro 		: RHEL4
	Kernel		: Xen-unstable kernel (based on 2.6.11) for both 
			  Domain 0 and guest domains
	
Xen-unstable was used since it provides SMP guest support. Additional 4-way 
domains were started using the ttylinux distro 
(http://easynews.dl.sourceforge.net/sourceforge/xen/ttylinux-xen.bz2).

The prime-number benchmark was run in one of the domains while the other
domains were idle. The throughput of the benchmark was measued once with 
a total of 30 4-way domains started and another time with 100 4-way domains 
started. In each case, to ensure that virtual-cpu/physical-cpu ration remains
same, I pinned each virtual CPU of each domain to corresponding physical CPU.
For ex: virtual CPU 0 of domains 1-30 was pinned physical CPU 0, virtual CPU 1
of domains 1-30 was pinned to physical CPU 1 and so on.

Case A : No-limit-on-sleep-interval

				Avg of 10 runs		Std Dev

30 domains started		28.55 sec		0.009
100 domains started		28.835 sec		0.011

Note that in going from 30->100 domains, the performance has deteriorated by 
only 0.9%.


Case B : 100ms limit on sleep interval. Since by default Xen guest domain kernel
	 follows case A, it was modified to restrict sleep interval to 100ms


     	            	 	Avg of 10 runs		Std Dev

30 domains started		28.867			0.010
100 domains started		29.993        		0.009

Note that in going from 30->100 domains, there is a deterioration of 3.9%

Comparing the two approaches,

			No limit	100 ms limit	Difference

	30 domains	  28.55		28.867		1.1%
	100 domains	  28.835	29.993		4%

The difference is growing with increasing virtualization. I have heard that 
the number of partitions can be much more (>300?) in some high end zSeries 
machines.


So those are the numbers. To conclude:

	- Currently, in 2.6.12 kernel, the avg real sleep time obtained
	  is ~1.7 sec
	- We can (and IMO we must) look at the short timers in use by the 
	  kernel and potentially increase this avg real sleep time 
	  further (atleast for idle CPUs). This mean waking up every 100 ms
	  makes it inefficient.
	- Even if we decide to go for the simple approach, we should increase
	  the sleep-time limit to few seconds. Don't know what that means 
	  for load-balancing.



-- 


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux