On Wednesday 06 April 2005 16:18, Richard B. Johnson wrote:
>
> Attached is inline ix86 memcpy() plus test code that tests its
> corner-cases. The in-line code makes no jumps, but uses longword
> copies, word copies and any spare byte copy. It works at all
> offsets, doesn't require alignment but would work fastest if
> both source and destination were longword aligned.
Yours is:
"shr $1, %%ecx\n" \
"pushf\n" \
"shr $1, %%ecx\n" \
"pushf\n" \ <=== not needed
"rep\n" \
"movsl\n" \
"popf\n" \ <=== not needed
"adcl %%ecx, %%ecx\n" \
"rep\n" \
"movsw\n" \
"popf\n" \
"adcl %%ecx, %%ecx\n" \
"rep\n" \
"movsb\n" \
You struggle too much for that movsw.
-mm one (which happen to be mine) is:
"movl %ecx,%4"
"shr $2,%ecx"
"rep ; movsl"
"movl %4,%%ecx"
"andl $3,%%ecx"
"jz 1ft" /* pay 2 byte penalty for a chance to skip microcoded rep */
"rep ; movsb"
"1:"
and I can still drop that jz. It is there just to have
a chance to skip rep movsb, it was measured to be slow
enough to matter. rep movs are a bit slow to start, on small
blocks it is measurable.
However, maybe it is even better without jz,
need to benchmark 'cold path' (i.e. where branch predictor
have no data to predict it) somehow.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]