Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 28 Jul 2005, Linus Torvalds wrote:

> There may be more upsides on other architectures (*cough*ia64*cough*) that 
> have strange scheduling issues and other complexities, but on x86 in 
> particular, the __builtin_xxx() functions tend to be a lot more pain than 
> they are worth. Not only do they have strange limitations (on selection of 

 They can be buggy, sure, just like any code.  If they indeed are we may 
want to avoid them rather than requiring a GCC upgrade, of course.

> opcodes but also for compiler versions), but they aren't well documented, 
> and semantics aren't clear.

 Hmm, that's what's in the GCC info pages for the relevant functions 
(I've omitted the "l" and "ll" variants):

"-- Built-in Function: int __builtin_ffs (unsigned int x)
     Returns one plus the index of the least significant 1-bit of X, or
     if X is zero, returns zero.

 -- Built-in Function: int __builtin_clz (unsigned int x)
     Returns the number of leading 0-bits in X, starting at the most
     significant bit position.  If X is 0, the result is undefined.

 -- Built-in Function: int __builtin_ctz (unsigned int x)
     Returns the number of trailing 0-bits in X, starting at the least
     significant bit position.  If X is 0, the result is undefined."

If that's not enough, then what would be?  I'm serious -- if you find it 
inadequate, then perhaps it could be improved.

> In contrast, the gcc builtins probably match some standard that is not 
> only harder to find, but also has some _other_ definition for what happens 
> for the zero case, so the builtins automatically end up having problems 
> due to semantic mis-match between the CPU and the standard.

 GCC should know the semantics of underlying CPU instructions used and be 
able to optimize expressions for common cases, e.g. like:

	return x == 0 ? 32 : __builtin_ctz(x);

when the CPU provides a "ctz" operation that returns 32 for 0.

> Basic rule: inline assembly is _better_ than random compiler extensions. 
> It's better to have _one_ well-documented extension that is very generic 
> than it is to have a thousand specialized extensions.

 It depends on how many submodel-specific variants of inline assembly you 
need, how many reloads are required for constraints, possibly defeating 
the gain, etc.

 In this particular case "bsf" and "bsr" are notoriously slow for some 
i386 submodels, so using the generic O(log n) algorithm may result in 
better performance for them.  E.g. the execution time for "bsf" for the 
original i386 is 11 + 3n clock ticks (n refers to the resulting bit 
index), "bsr" for the i486 is 7 - 104 ticks (so again 3n), for Pentium -- 
7 - 72 ticks (2n, then).  It does not immediately mean they are worse, but 
they are slow enough for the pessimistic case checking alternatives is not 
unreasonable.

  Maciej
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux