Hi Roman!
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
u32 s_pitch, u32 height)
+{
+ int i, j;
+
+ if (likely(s_pitch==1))
+ for(i=0; i < height; i++)
+ dst[d_pitch*i] = src[i];
I added the multiply back because gcc (v. 3.3.4) does generate the
fastest code
if I write it this way. I compiled, inspected the generated assembly and
benchmarked
about a dozend variations of the code (benchmark as previously described).
The special case for s_pitch == 1 saves about 10 ms system time (770 ms
-> 760 ms)
The special case for s_pitch == 2 saves about 270 ms system time (2120
-> 1850ms)
with a 16x30 font.
The third case is for even bigger fonts ... I believe that it will not
often be used but
something like that must be present.
You have now 3 slightly different variants of the same, which isn't really
an improvement. In my example I showed you how to generate the first and
last version from the same source.
The first version will only be generated when gcc can be sure that
s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that
value:
if (likely(idx == 1))
__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
With the version I propose it´s enough to write
__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.
If you also want to optimize for other sizes, you might want to always
inline the function, if the function call overhead is the largest part
anyway, the special case for 2 bytes might not be needed anymore.
fb_pad_aligned_buffer() is usefull to save some space in cases like
softcursor.
It´s also used by some drivers (nvidia and riva), but the authors of
those drivers
have to decide if they prefer the inlined version or the version fixed
in fbmem.
BTW this version saves another condition:
static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
int i, j;
d_pitch -= s_pitch;
i = height;
do {
/* s_pitch is a few bytes at the most, memcpy is suboptimal */
j = s_pitch;
do
*dst++ = *src++;
while (--j > 0);
dst += d_pitch;
} while (--i > 0);
}
I tested that code, together with the followig code in but_putcs():
if(idx==1)
__fb_pad_aligned_buffer(dst, pitch, src, 1,
image.height);
else if (idx==2)
__fb_pad_aligned_buffer(dst, pitch, src, 2,
image.height);
else
fb_pad_aligned_buffer(dst, pitch, src, idx,
image.height);
dst += width;
It´s as fast/slow as your previous version, the measurements are almost
identical.
Let´s summarize:
Your version of __fb_pad_aligned_buffer looks much better, but it needs
not so nice
conditionals when used. My version looks bad, but it is easier to use
and it is
_faster_.
cu,
knut
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
|
|