Re: amd64 bitops fix for -Os

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 29 Oct 2005 19:56:02 -0200 Alexandre Oliva wrote:

> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=171672
> 
> This patches fixes a bug that comes up when compiling the kernel for
> x86_64 optimizing for size.  It affects 2.6.14 for sure, but I'm
> pretty sure many earlier kernels are affected as well.
> 
> The symptom is that, as soon as some change is made to the root
> filesystem (e.g. dmesg > /var/log/dmesg), the kernel mostly hangs.  It
> was not the first time I'd run into this symptom, but this time I
> could track the problem down to enabling size optimizations in the
> kernel build.  It took some time to narrow down the culprit source
> with a binary search, compiling part of the kernel sources with -Os
> and part with -O2, but eventually it was clear that bitops itself was
> to blame, which should have been clear from the soft lockup oops I
> got.
> 
> The problem is that find_first_zero_bit() fails when called with an
> underflown size, because its inline asm assumes at least one iteration
> of scasq will run.  When this does not hold, the conditional branch
> that follows it uses flags from instructions prior to the asm
> statement.
> 
> When optimizing for speed, the generated code is such that the flags
> will have the correct value, because of the side effects on flags of
> the right shift of the size, that survive through to the asm
> statement.  When optimizing for size, however, the mov instruction
> used to initialize %rax with -1 is replaced with a smaller or
> instruction, that modifies the flags and thus breaks the
> zero-trip-count case.
> 
> Obviously the asm statement must not rely on the compiler setting up
> flags by chance, so we have to either force the flags to be set
> properly or make sure we run scasq at least once.  In teh
> find_first_zero_bit case, this comes at pretty much no cost, since we
> already test size for non-zero, but we used to do that adjusting it
> from bits to words; changing it should have no visible effect on
> performance.
> 
> As for find_first_bit, it's quite likely that the same bug is present
> when it's called by find_next_bit in the same conditions, but
> find_first_bit doesn't even test for zero.  AFAICT, it has just been
> luckier, so I went ahead and added the same guard code to it.  This
> unfortunately adds a test to the fast path, but I don't see how to
> avoid that without auditing all callers.
> 
> I actually introduce means to guard against these cases in the public
> wrappers, but the BUG_ONs are disabled by default.  I've left a kernel
> running with them enabled for a bit, and they never hit, which is a
> good sign, but I haven't tested it thoroughly or anything.  We could
> probably do away with these new tests by modifying the find_next*bit
> functions so as to not call the find_first*bit functions if they've
> already exhausted the range implied by the size argument.  I'm not
> sure whether that's worth doing, though, so I didn't.
> 
> While staring at the code and trying to figure out what the problem
> was, I removed some needless casts from find_next_zero_bit, by
> constifying the automatic pointer properly, and also moved the actual
> code from find_first_zero_bit to a separate internal function, such
> that we could add the bug-check to the public interface only.
> 
> I also noticed find_first_zero_bit was less efficient than
> find_first_bit in that the former saved and restored rbx, because GCC
> chose that to hold (addr) within the asm statement, instead of using
> the readily-available and caller-saved rsi.  I've thus changed the
> code to prefer rsi, although in a perfect world the compiler would be
> able to figure that out by itself.
> 
> The compiler could do a bit better in find_first_zero_bit: if the
> initial size turns out to be zero, it could return, like it does in
> find_first_bit, but instead of sets rdx to zero and jumps to the end
> of the function where rdx is copied to rax before the return
> statement.  This is a negative effect of the assignment of variable
> res to rdx instead of rax, which gets the register allocator to map
> the pseudo register representing the return value to rdx, requiring a
> copy at the end and preventing (as far as the dumb compiler can see
> :-) the direct use of a return in the zero-size case.  I've verified
> that this is not caused by the additional inline function that I
> introduced.
> 
> I tried to change the use of registers so as to enable the better code
> for this path, but I couldn't come up with anything that was as
> efficient, so I figured I wouldn't try to optimize the exceptional
> path in expense of the common fast path and left it alone.  If anyone
> can come up with something better, please go ahead.
> 
> 
> Anyhow, with this patch I could run 2.6.14, as in the Fedora
> development tree, except for the change to optimize for size.
> 
> 	Signed-off-by: Alexandre Oliva <[email protected]>
> 
> --- arch/x86_64/lib/bitops.c~	2005-10-27 22:02:08.000000000 -0200
> +++ arch/x86_64/lib/bitops.c	2005-10-29 18:24:27.000000000 -0200
> @@ -1,5 +1,11 @@
>  #include <linux/bitops.h>
>  
> +#define BITOPS_CHECK_UNDERFLOW_RANGE 0
> +
> +#if BITOPS_CHECK_UNDERFLOW_RANGE
> +# include <linux/kernel.h>
> +#endif
> +
>  #undef find_first_zero_bit
>  #undef find_next_zero_bit
>  #undef find_first_bit
> @@ -13,11 +19,21 @@
>   * Returns the bit-number of the first zero bit, not the number of the byte
>   * containing a bit.
>   */
> -inline long find_first_zero_bit(const unsigned long * addr, unsigned long size)
> +static inline long
> +__find_first_zero_bit(const unsigned long * addr, unsigned long size)
>  {
>  	long d0, d1, d2;
>  	long res;
>  
> +	/* We must test the size in words, not in bits, because
> +	   otherwise incoming sizes in the range -63..-1 will not run
> +	   any scasq instructions, and then the flags used by the je
> +	   instruction will have whatever random value was in place
> +	   before.  Nobody should call us like that, but
> +	   find_next_zero_bit() does when offset and size are at the
> +	   same word and it fails to find a zero itself.  */
> +	size += 63;
> +	size >>= 6;
>  	if (!size)
>  		return 0;
>  	asm volatile(
> @@ -30,11 +46,22 @@
>  		"  shlq $3,%%rdi\n"
>  		"  addq %%rdi,%%rdx"
>  		:"=d" (res), "=&c" (d0), "=&D" (d1), "=&a" (d2)
> -		:"0" (0ULL), "1" ((size + 63) >> 6), "2" (addr), "3" (-1ULL),
> -		 [addr] "r" (addr) : "memory");
> +		:"0" (0ULL), "1" (size), "2" (addr), "3" (-1ULL),
> +		 /* Any register here would do, but GCC tends to
> +		    prefer rbx over rsi, even though rsi is readily
> +		    available and doesn't have to be saved.  */
> +		 [addr] "S" (addr) : "memory");
>  	return res;
>  }
>  
> +long find_first_zero_bit(const unsigned long * addr, unsigned long size)
> +{
> +#if BITOPS_CHECK_UNDERFLOW_RANGE
> +	BUG_ON (size + 63 < size);
> +#endif
> +	return __find_first_zero_bit (addr, size);
> +}
> +
>  /**
>   * find_next_zero_bit - find the first zero bit in a memory region
>   * @addr: The address to base the search on
> @@ -43,7 +70,7 @@
>   */
>  long find_next_zero_bit (const unsigned long * addr, long size, long offset)
>  {
> -	unsigned long * p = ((unsigned long *) addr) + (offset >> 6);
> +	const unsigned long * p = addr + (offset >> 6);
>  	unsigned long set = 0;
>  	unsigned long res, bit = offset&63;
>  
> @@ -63,8 +90,8 @@
>  	/*
>  	 * No zero yet, search remaining full words for a zero
>  	 */
> -	res = find_first_zero_bit ((const unsigned long *)p,
> -				   size - 64 * (p - (unsigned long *) addr));
> +	res = __find_first_zero_bit (p, size - 64 * (p - addr));
> +
>  	return (offset + set + res);
>  }
>  
> @@ -74,6 +101,17 @@
>  	long d0, d1;
>  	long res;
>  
> +	/* We must test the size in words, not in bits, because
> +	   otherwise incoming sizes in the range -63..-1 will not run
> +	   any scasq instructions, and then the flags used by the jz
> +	   instruction will have whatever random value was in place
> +	   before.  Nobody should call us like that, but
> +	   find_next_bit() does when offset and size are at the same
> +	   word and it fails to find a one itself.  */
> +	size += 63;
> +	size >>= 6;
> +	if (!size)
> +		return 0;
>  	asm volatile(
>  		"   repe; scasq\n"
>  		"   jz 1f\n"
> @@ -83,8 +121,7 @@
>  		"   shlq $3,%%rdi\n"
>  		"   addq %%rdi,%%rax"
>  		:"=a" (res), "=&c" (d0), "=&D" (d1)
> -		:"0" (0ULL),
> -		 "1" ((size + 63) >> 6), "2" (addr),
> +		:"0" (0ULL), "1" (size), "2" (addr),
>  		 [addr] "r" (addr) : "memory");
>  	return res;
>  }
> @@ -99,6 +136,9 @@
>   */
>  long find_first_bit(const unsigned long * addr, unsigned long size)
>  {
> +#if BITOPS_CHECK_UNDERFLOW_RANGE
> +	BUG_ON (size + 63 < size);
> +#endif
>  	return __find_first_bit(addr,size);
>  }

Thanks.  I'll test this early next week.
I always use -Os and (on x86_64) I always get a SOFT LOCKUP
during boot or when loading X.  For me, it's always in ext3fs,
doing some flavor of bitops, so I have high hopes for this fixing
that problem.


---
~Randy
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux