Re: [patch 2/3] mutex subsystem: fastpath inlining

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sorry for the delay...

On Thu, 29 Dec 2005, Ingo Molnar wrote:

> 
> * Nicolas Pitre <[email protected]> wrote:
> 
> > This is with all mutex patches applied and CONFIG_DEBUG_MUTEX_FULL=n, 
> > therefore using the current semaphore code:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1821108	 287792	  88264	2197164	 2186ac	vmlinux
> > 
> > Now with CONFIG_DEBUG_MUTEX_FULL=y to substitute semaphores with 
> > mutexes:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1797108	 287568	  88172	2172848	 2127b0	vmlinux
> > 
> > Finally with CONFIG_DEBUG_MUTEX_FULL=y and fast paths inlined:
> > 
> >    text	   data	    bss	    dec	    hex	filename
> > 1807824	 287136	  88172	2183132	 214fdc	vmlinux
> > 
> > This last case is not the smallest, but it is the fastest.
> 
> i.e. 1.3% text savings from going to mutexes, and inlining them again 
> gives up 0.5% of that. We've uninlined stuff for a smaller gain in the 
> past ...
> 
> > > Note that x86 went to a non-inlined fastpath _despite_ 
> > > having a compact CISC semaphore fastpath.
> > 
> > The function call overhead on x86 is less significant than the ARM 
> > one, so always calling out of line code might be sensible in that 
> > case.
> 
> i'm highly doubtful we should do that. The spinlock APIs are 4 times 
> more frequent than mutexes are ever going to be, still they too are 
> mostly out of line. (and we only inline the unlock portions that are a 
> space win!) Can you measure any significant difference in performance?  
> (e.g. lat_pipe triggers the mutex fastpath, in DEBUG_MUTEX_FULL=y mode) 

Here it is.

With the default non inlined fast path:

Pipe latency: 14.2669 microseconds
Pipe latency: 14.2625 microseconds
Pipe latency: 14.2655 microseconds
Pipe latency: 14.2670 microseconds

Then, with fast paths inlined:

Pipe latency: 13.9483 microseconds
Pipe latency: 13.9409 microseconds
Pipe latency: 13.9468 microseconds
Pipe latency: 13.9529 microseconds

So inlining the mutex fast path is more than 2% faster for the whole 
test.

Is that worth the 0.5% increase in kernel size in your opinion?

Note: I modified lat_pipe to use two threads instead of two processes 
since on ARM switching mm require a full cache flush due to the VIVT 
cache, and doing so skyrockets the pipe latency up to 480 microsecs with 
the mutex difference lost in the noise.


Nicolas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux