Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance

* Christoph Lameter (clameter@sgi.com) wrote:
> Ok here is a replacement patch for the cmpxchg patch. Problems
> 
> 1. cmpxchg_local is not available on all arches. If we wanted to do
>    this then it needs to be universally available.
> 

cmpxchg_local is not available on all archs, but local_cmpxchg is. It
expects a local_t type which is nothing else than a long. When the local
atomic operation is not more efficient or not implemented on a given
architecture, asm-generic/local.h falls back on atomic_long_t. If you
want, you could work on the local_t type, which you could cast from a
long to a pointer when you need so, since their size are, AFAIK, always
the same (and some VM code even assume this is always the case).

> 2. cmpxchg_local does generate the "lock" prefix. It should not do that.
>    Without fixes to cmpxchg_local we cannot expect maximum performance.
> 

Yup, see the patch I just posted for this.

> 3. The approach is x86 centric. It relies on a cmpxchg that does not
>    synchronize with memory used by other cpus and therefore is more
>    lightweight. As far as I know the IA64 cmpxchg cannot do that.
>    Neither several other processors. I am not sure how cmpxchgless
>    platforms would use that. We need a detailed comparison of
>    interrupt enable /disable vs. cmpxchg cycle counts for cachelines in
>    the cpu cache to evaluate the impact that such a change would have.
> 
>    The cmpxchg (or its emulation) does not need any barriers since the
>    accesses can only come from a single processor. 
> 

Yes, expected improvements goes as follow:
x86, x86_64 : must faster due to non-LOCKed cmpxchg
alpha: should be faster due to memory barrier removal
mips: memory barriers removed
powerpc 32/64: memory barriers removed

On other architectures, either there is no better implementation than
the standard atomic cmpxchg or it just has not been implemented.

I guess that a test series that would tell us how must improvement is
seen on the optimized architectures (local cmpxchg vs interrupt
enable/disable) and also what effect the standard cmpxchg has compared
to interrupt disable/enable on the architectures where we can't do
better than the standard cmpxchg will tell us if it is an interesting
way to go.  I would be happy to do these tests, but I don't have the
hardware handy. I provide a test module to get these characteristics
from various architectures in this email.

> Mathieu measured a significant performance benefit coming from not using
> interrupt enable / disable.
> 
> Some rough processor cycle counts (anyone have better numbers?)
> 
> 	STI	CLI	CMPXCHG
> IA32	36	26	1 (assume XCHG == CMPXCHG, sti/cli also need stack pushes/pulls)
> IA64	12	12	1 (but ar.ccv needs 11 cycles to set comparator,
> 			need register moves to preserve processors flags)
> 

The measurements I get (in cycles):

             enable interrupts (STI)   disable interrupts (CLI)   local CMPXCHG
IA32 (P4)    112                        82                         26
x86_64 AMD64 125                       102                         19

> Looks like STI/CLI is pretty expensive and it seems that we may be able to
> optimize the alloc / free hotpath quite a bit if we could drop the 
> interrupt enable / disable. But we need some measurements.
> 
> 
> Draft of a new patch:
> 
> SLUB: Single atomic instruction alloc/free using cmpxchg_local
> 
> A cmpxchg allows us to avoid disabling and enabling interrupts. The cmpxchg
> is optimal to allow operations on per cpu freelist. We can stay on one
> processor by disabling preemption() and allowing concurrent interrupts
> thus avoiding the overhead of disabling and enabling interrupts.
> 
> Pro:
> 	- No need to disable interrupts.
> 	- Preempt disable /enable vanishes on non preempt kernels
> Con:
>         - Slightly complexer handling.
> 	- Updates to atomic instructions needed
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 

Test local cmpxchg vs int disable/enable. Please run on a 2.6.22 kernel
(or recent 2.6.21-rcX-mmX) (with my cmpxchg local fix patch for x86_64).
Make sure the TSC reads (get_cycles()) are reliable on your platform.

Mathieu

/* test-cmpxchg-nolock.c
 *
 * Compare local cmpxchg with irq disable / enable.
 */

#include <linux/jiffies.h>
#include <linux/compiler.h>
#include <linux/init.h>
#include <linux/module.h>
#include <linux/calc64.h>
#include <asm/timex.h>
#include <asm/system.h>

#define NR_LOOPS 20000

int test_val = 0;

static void do_test_cmpxchg(void)
{
	int ret;
	long flags;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for (i = 0; i < NR_LOOPS; i++) {
		ret = cmpxchg_local(&test_val, 0, 0);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for non locked cmpxchg\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> non locked cmpxchg takes %llu cycles\n", time);
	printk(KERN_ALERT "test end\n");
}

/*
 * This test will have a higher standard deviation due to incoming interrupts.
 */
static void do_test_enable_int(void)
{
	long flags;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for (i = 0; i < NR_LOOPS; i++) {
		local_irq_restore(flags);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for enabling interrupts (STI)\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> enabling interrupts (STI) takes %llu cycles\n",
					time);
	printk(KERN_ALERT "test end\n");
}

static void do_test_disable_int(void)
{
	unsigned long flags, flags2;
	unsigned int i;
	cycles_t time1, time2, time;
	long rem;

	local_irq_save(flags);
	preempt_disable();
	time1 = get_cycles();
	for ( i = 0; i < NR_LOOPS; i++) {
		local_irq_save(flags2);
	}
	time2 = get_cycles();
	local_irq_restore(flags);
	preempt_enable();
	time = time2 - time1;

	printk(KERN_ALERT "test results: time for disabling interrupts (CLI)\n");
	printk(KERN_ALERT "number of loops: %d\n", NR_LOOPS);
	printk(KERN_ALERT "total time: %llu\n", time);
	time = div_long_long_rem(time, NR_LOOPS, &rem);
	printk(KERN_ALERT "-> disabling interrupts (CLI) takes %llu cycles\n",
				time);
	printk(KERN_ALERT "test end\n");
}



static int ltt_test_init(void)
{
	printk(KERN_ALERT "test init\n");
	
	do_test_cmpxchg();
	do_test_enable_int();
	do_test_disable_int();
	return -EAGAIN; /* Fail will directly unload the module */
}

static void ltt_test_exit(void)
{
	printk(KERN_ALERT "test exit\n");
}

module_init(ltt_test_init)
module_exit(ltt_test_exit)

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Mathieu Desnoyers");
MODULE_DESCRIPTION("Cmpxchg local test");

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Mathieu Desnoyers <compudj@krystal.dyndns.org>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>

References:
- [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Andi Kleen <andi@firstfloor.org>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Martin Bligh <mbligh@google.com>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Martin Bligh <mbligh@mbligh.org>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
- Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
  - From: Christoph Lameter <clameter@sgi.com>

Prev by Date: Re: [kvm-devel] [PATCH][RFC] kvm-scheduler integration
Next by Date: Re: [patch 09/10] Remove the SLOB allocator for 2.6.23
Previous by thread: Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
Next by thread: Re: [patch 00/10] [RFC] SLUB patches for more functionality, performance and maintenance
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]