On Thu, Apr 28, 2005 at 02:29:26PM -0400, Benjamin LaHaise wrote:
> Please review the following series of patches for unifying the
> semaphore implementation across all architectures (not posted as
> they're about 350K), as they have only been tested on x86-64. The
> code generated is functionally identical to the earlier i386
> variant, but since gcc has no way of taking condition codes as
> results, there are two additional instructions inserted from the
> use of generic atomic operations. All told the >6000 lines of code
> deleted makes for a much easier job for subsequent patches changing
> semaphore functionality. Cheers,
I'm not sure why we're doing this, apart from a desire to unify stuff.
What happened to efficiency and performance?
It is my understanding that the inline part of the semaphore
implementation was one of the critical areas - critical enough to
warrant coding it in assembly for some people.
So, I set about testing this implementation against the ARM
implementation. For this, I used this code:
void it(struct semaphore *sem, int *val)
{
down(sem);
*val = *val + 1;
up(sem);
}
which is a relatively simple test case.
The ARM assembly implementation for this gave:
str lr, [sp, #-4]!
@ down_op
mrs ip, cpsr @ local_irq_save
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
subs lr, lr, #1
str lr, [r0]
msr cpsr_c, ip @ local_irq_restore
movmi ip, r0
blmi __down_failed
ldr r3, [r1, #0]
add r3, r3, #1
str r3, [r1, #0]
@ up_op
mrs ip, cpsr @ local_irq_save
orr lr, ip, #128
msr cpsr_c, lr
ldr lr, [r0]
adds lr, lr, #1
str lr, [r0]
msr cpsr_c, ip @ local_irq_restore
movle ip, r0
blle __up_wakeup
ldr pc, [sp], #4
Stack: 1 location
Registers: 3 on top of the two arguments
Instructions: 23, 23 in the common execution path.
The atomic-op based implementation gave:
stmfd sp!, {r4, r5, lr}
mov r5, r1
mov r4, r0
mrs r2, cpsr @ local_irq_save
orr r1, r2, #128
msr cpsr_c, r1
ldr r3, [r0, #0]
sub r3, r3, #1
str r3, [r0, #0]
msr cpsr_c, r2 @ local_irq_restore
cmp r3, #0
blt .L8
.L4: ldr r3, [r5, #0]
add r3, r3, #1
str r3, [r5, #0]
mrs r2, cpsr @ local_irq_save
orr r1, r2, #128
msr cpsr_c, r1
ldr r3, [r4, #0]
add r3, r3, #1
str r3, [r4, #0]
msr cpsr_c, r2 @ local_irq_restore
cmp r3, #0
mov r0, r4
ldmgtfd sp!, {r4, r5, pc}
ldmfd sp!, {r4, r5, lr}
b up_wakeup
.L8: bl down_failed
b .L4
Stack: 3 locations
Registers: 5 on top of the two arguments
Instructions: 29, 27 in the common execution path.
So, Ben's implementation is more expensive in terms of stack usage,
register usage, and number of instructions.
Why is this? The answer is that gcc is unable to properly optimise
for ARM - I've no idea why. One such simple thing is that the
sequence:
blt .L8
.L4:
...
.L8: bl down_failed
b .L4
can be easily rewritten as:
bllt down_failed
Another reason for the extra code bloat is that GCC can't propagate
PSR flags between the sub/add instructions across a memory barrier
(which local_irq_restore is.)
This makes the ARM assembly code implementation superior to any unified
version, no matter how you look at it from efficiency or performance
points of view. The _only_ win is that of unification.
However, given that the kernel has bloated itself by 300K per stable
kernel series on ARM for the same functionality, I'd prefer to keep
my more efficient higher performance smaller code size semaphores
please.
--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]