Re: [PATCH] speed up on find_first_bit for i386 (let compiler do the work)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Thu, 28 Jul 2005, Steven Rostedt wrote:
> 
> OK, I guess when I get some time, I'll start testing all the i386 bitop
> functions, comparing the asm with the gcc versions.  Now could someone
> explain to me what's wrong with testing hot cache code. Can one
> instruction retrieve from memory better than others?

There's a few issues:

 - trivially: code/data size. Being smaller automatically means faster if
   you're cold-cache. If you do cycle tweaking of something that is 
   possibly commonly in the L2 cache or further away, you migt as well
   consider one byte of code-space to be equivalent to one cycle (a L1 I$ 
   miss can easily take 50+ cycles - the L1 fill cost may be just a small 
   part of that, but the pipeline problem it causes can be deadly).

 - branch prediction: cold-cache is _different_ from hot-cache. hit-cache 
   predicts the stuff dynamically, cold-cache has different rules (and it 
   is _usually_ "forward predicts not-taken, backwards predicts taken", 
   although you can add static hints if you want to on most architectures).

   So hot-cache may look very different indeed - the "normal" case might 
   be that you mispredict all the time because the static prediction is 
   wrong, but then a hot-cache benchmark will predict perfectly.

 - access patterns. This only matters if you look at algorithmic changes. 
   Hashes have atrocious locality, but on the other hand, if you know that 
   the access pattern is cold, a hash will often have a minimum number of 
   accesses. 

but no, you don't have "some instructions are better at reading from 
memory" for regular integer code (FP often has other issues, like reading 
directly from L2 without polluting L1, and then there are obviously 
prefetch hints).

Now, in the case of your "rep scas" conversion, the reason I applied it
was that it was obviously a clear win (rep scas is known bad, and has
register allocation issues too), so I'm _not_ claiming that the above
issues were true in that case. I just wanted to say that in general it's 
nice (but often quite hard) if you can give cold-cache numbers too (for 
example, using the cycle counter and being clever can actually give that).

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux