Re: 2.6.19-rt14 slowdown compared to 2.6.19

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



* Chen, Tim C <[email protected]> wrote:

> Ingo,
>  
> We did some benchmarking on 2.6.19-rt14, compared it with 2.6.19 
> kernel and noticed several slowdowns.  The test machine is a 2 socket 
> woodcrest machine with your default configuration.

cool - thanks for the feedback! Running the 64-bit kernel, right?

> Netperf TCP Streaming was slower by 40% ( 1 server and 1 client each 
> bound to separate cpu cores on different socket, network loopback mode 
> was used).
> 
> Volanomark was slower by 80% (Server and Clients communicate with 
> network loopback mode. Idle time goes from 1% to 60%)
> 
> Re-Aim7 was slower by 40% (idle time goes from 0% to 20%)
> 
> Wonder if you have any suggestions on what could cause the slowdown.  
> We've tried disabling CONFIG_NO_HZ and it didn't help much.

while some slowdown is to be expected, did in each case idle time 
increase significantly? If yes then this is the effect of lock 
contention. Lock contention effects are 'magnified' by PREEMPT_RT. For 
example if you run 128 threads workload that all use the same lock then 
the -rt kernel can act as if it were a 128-way box (!). This way by 
running -rt you'll see scalability problems alot sooner than on real 
hardware. In other words: PREEMPT_RT in essence simulates the 
scalability behavior of up to an infinite amount of CPUs. (with the 
exception of cachemiss emulation ;) [the effect is not this precise, but 
that's the rough trend]

the good news: any fix for such problems in -rt is useful for the 
upstream kernel too, because it improves scalability on real hardware as 
well. The bad news: this work hasnt been finished yet ;-)

We've fixed a couple of contention points discovered by -rt this way, in 
particular Peter's excellent scalability patches at:

  http://programming.kicks-ass.net/kernel-patches/

were done based on lock-contention profiling feedback. I've included the 
lockless pagecache and the s_files scalability patches which were very 
useful already. The lru_cache_add one is the next one for me to try, 
plus on i686 -rt kernels highmem locking hurts, hence the 
highmem_rewrite.patch (not in -rt yet either). Plus there's the really 
exciting concurrent-pagecache patch-queue from Peter, which will be 
interesting to try too.

If you'd like to profile this yourself then the lowest-cost way of 
profiling lock contention on -rt is to use the yum kernel and run the 
attached trace-it-lock-prof.c code on the box while your workload is in 
'steady state' (and is showing those extended idle times):

  ./trace-it-lock-prof > trace.txt

this captures up to 1 second worth of system activity, on the current 
CPU. Then you can construct the histogram via:

  grep -A 1 ' __schedule()<-' trace.txt | cut -d: -f2- | sort |
                                  uniq -c | sort -n > prof.txt

i just did this for "hackbench 20", and it gave me:

     38  kfree()<-skb_release_data()<-kfree_skbmem()<-(-1)()
     40  kmem_cache_free()<-kfree_skbmem()<-__kfree_skb()<-(-1)()
     47  __schedule()<-preempt_schedule_irq()<-retint_kernel()<-__spin_unlock_irqrestore()
    173  rt_read_lock()<-sock_def_readable()<-unix_stream_sendmsg()<-(-1)()
    173  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-__lock_text_start()
    274  __schedule()<-schedule()<-posix_cpu_timers_thread()<-kthread()
    277  __schedule()<-schedule()<-ksoftirqd()<-kthread()
    365  kmem_cache_alloc_node()<-__alloc_skb()<-sock_alloc_send_skb()<-(-1)()
    366  kmem_cache_alloc_node()<-__kmalloc_node()<-__alloc_skb()<-(-1)()
    814  __schedule()<-schedule()<-rt_spin_lock_slowlock()<-rt_spin_lock()
   1819  __schedule()<-schedule()<-int_careful()<-<ffffe405>()
   2346  __schedule()<-schedule()<-schedule_timeout()<-unix_stream_recvmsg()
   2346  sock_aio_read()<-do_sync_read()<-vfs_read()<-(-1)()

(note the output mixes the two lines of symbolic stack backtrace entries 
that are generated by the tracer, so the last two lines for example are 
for the same contention/preemption point.)

this means that the biggest contention points (or natural rescheduling 
points) in this workload are sock_aio_read()->unix_stream_recvmsg() - 
not unexpected. But there's also ~10% of contention in the kmalloc code, 
meaning that for this workload the per-CPU slabs are probably set a bit 
too low. There's also ~5% of contention from sock_def_readable()'s use 
of an rwlock.

with more CPUs and more tasks you'll see these effects in a more drastic 
way.

	Ingo
/*
 * Copyright (C) 2005, Ingo Molnar <[email protected]>
 *
 * user-triggered tracing.
 *
 * The -rt kernel has a built-in kernel tracer, which will trace
 * all kernel function calls (and a couple of special events as well),
 * by using a build-time gcc feature that instruments all kernel
 * functions.
 * 
 * The tracer is highly automated for a number of latency tracing purposes,
 * but it can also be switched into 'user-triggered' mode, which is a
 * half-automatic tracing mode where userspace apps start and stop the
 * tracer. This file shows a dumb example how to turn user-triggered
 * tracing on, and how to start/stop tracing. Note that if you do
 * multiple start/stop sequences, the kernel will do a maximum search
 * over their latencies, and will keep the trace of the largest latency
 * in /proc/latency_trace. The maximums are also reported to the kernel
 * log. (but can also be read from /proc/sys/kernel/preempt_max_latency)
 *
 * For the tracer to be activated, turn on CONFIG_WAKEUP_TIMING and
 * CONFIG_LATENCY_TRACE in the .config, rebuild the kernel and boot
 * into it. Note that the tracer can have significant runtime overhead,
 * so you dont want to use it for performance testing :)
 */

#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <linux/unistd.h>

int main (int argc, char **argv)
{
	int ret;

	if (getuid() != 0) {
		fprintf(stderr, "needs to run as root.\n");
		exit(1);
	}
	ret = system("cat /proc/sys/kernel/mcount_enabled >/dev/null 2>/dev/null");
	if (ret) {
		fprintf(stderr, "CONFIG_LATENCY_TRACING not enabled?\n");
		exit(1);
	}
	system("echo 1 > /proc/sys/kernel/trace_enabled");
	system("echo 0 > /proc/sys/kernel/trace_freerunning");
	system("echo 0 > /proc/sys/kernel/trace_print_at_crash");
	system("echo 1 > /proc/sys/kernel/trace_user_triggered");
	system("echo 1 > /proc/sys/kernel/stackframe_tracing");
	system("echo 0 > /proc/sys/kernel/trace_verbose");
	system("echo 0 > /proc/sys/kernel/preempt_max_latency");
	system("echo 0 > /proc/sys/kernel/preempt_thresh");
	system("[ -e /proc/sys/kernel/wakeup_timing ] && echo 1 > /proc/sys/kernel/wakeup_timing");
	system("echo 0 > /proc/sys/kernel/mcount_enabled");

	// start tracing
	if (prctl(0, 1)) {
		fprintf(stderr, "trace-it: couldnt start tracing!\n");
		return 1;
	}
	usleep(1000000);
	if (prctl(0, 0)) {
		fprintf(stderr, "trace-it: couldnt start tracing!\n");
		return 1;
	}

	system("cat /proc/latency_trace");

	return 0;
}



[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux