Andi Kleen schrieb:
>>The SSE clear page fuction is almost twice as fast as the kernel's
>>current clear_page, while the copy_page implementation is roughly a
>>third faster. This is likely due to the fact that SSE instructions
>>can keep the 256 bit wide L2 cache bus at a higher utilisation than
>>64 bit movs are able to. Comments?
>>
>>
>
>Any use of write combining is wrong here because it forces
>the destination out of cache, which causes performance issues later on.
>Believe me we went through this years ago.
>
>If you can code up a better function for P4 that does not use
>write combining I would be happy to add. I never tuned the functions
>for P4.
>
>One simple experiment would be to just test if P4 likes the
>simple rep ; movsq / rep ; stosq loops and enable them.
>
>
No it doesn't like this sample here at all,I'll get segmentationfault on
that run.
RUN 1:
SSE test program $Id: fast.c,v 1.6 2000/09/23 09:05:45 arjan Exp $
buffer = 0x2aaaaade7000
clear_page() tests
clear_page function 'warm up run' took 13516 cycles per page
clear_page function 'kernel clear' took 6539 cycles per page
clear_page function '2.4 non MMX' took 6354 cycles per page
clear_page function '2.4 MMX fallback' took 6205 cycles per page
clear_page function '2.4 MMX version' took 6830 cycles per page
clear_page function 'faster_clear_page' took 6240 cycles per page
clear_page function 'even_faster_clear' took 5746 cycles per page
clear_page function 'xmm_clear ' took 4580 cycles per page
Segmentation fault
xmm64.o[9485] general protection rip:400814 rsp:7fffffc74118 error:0
xmm64.o[9486] general protection rip:400814 rsp:7fffff8b1498 error:0
xmm64.o[9487] general protection rip:400814 rsp:7fffffc31848 error:0
RUN 2:
Tell gcc use processor specific flags
gcc -pipe -march=nocona -O2 -o xmm64.o xmm64.c
SSE test program $Id: fast.c,v 1.6 2000/09/23 09:05:45 arjan Exp $
buffer = 0x2aaaaade7000
clear_page() tests
clear_page function 'warm up run' took 13419 cycles per page
clear_page function 'kernel clear' took 6403 cycles per page
clear_page function '2.4 non MMX' took 6290 cycles per page
clear_page function '2.4 MMX fallback' took 6156 cycles per page
clear_page function '2.4 MMX version' took 6605 cycles per page
clear_page function 'faster_clear_page' took 5607 cycles per page
clear_page function 'even_faster_clear' took 5173 cycles per page
clear_page function 'xmm_clear ' took 4307 cycles per page
clear_page function 'xmma_clear ' took 6230 cycles per page
clear_page function 'xmm2_clear ' took 4908 cycles per page
clear_page function 'xmma2_clear ' took 6256 cycles per page
clear_page function 'kernel clear' took 6506 cycles per page
copy_page() tests
copy_page function 'warm up run' took 10352 cycles per page
copy_page function '2.4 non MMX' took 9440 cycles per page
copy_page function '2.4 MMX fallback' took 9300 cycles per page
copy_page function '2.4 MMX version' took 10238 cycles per page
copy_page function 'faster_copy' took 9497 cycles per page
copy_page function 'even_faster' took 9229 cycles per page
copy_page function 'xmm_copy_page_no' took 7810 cycles per page
copy_page function 'xmm_copy_page' took 7397 cycles per page
copy_page function 'xmma_copy_page' took 9430 cycles per page
copy_page function 'v26_copy_page' took 9234 cycles per page
CPU flags on Intel Pentium 4 640 x86_64 Gentoo GNU/Linux
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm
syscall nx lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr
Greets
Michael
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]