Re: [rfc patch] i386: don't save eflags on task switch

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 4 Nov 2006, Benjamin LaHaise wrote:

> On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
> > Every processor I've ever measured it on, popf is slower.  On P4, for 
> > example, pushf is 6 cycles, and popf is 54.  On Opteron, it is 2 / 12.  
> > On Xeon, it is 7 / 91.
> 
> pushf has to wait until all flag dependancies can be resolved.  On the 
> P4 with >100 instructions in flight, that can take a long time.

That's like saying that "addc has to wait until all flag dependencies can 
be resolved". Not so. It just needs to execute _after_, but it pipelines 
perfectly fine. There's no need for any pipeline flush, nothing like that. 
Yes, an "addc" needs the flags from previous instructions, but that 
doesn't really make an addc any slower - it just adds a data dependency.

Same goes for pushf. Sure - it has a data dependency on the flags. The 
flags are usually there one cycle (ONE CYCLE!) after the actual value has 
been computed, so you should expect it to be fast.

> Popf on the other hand has no dependancies on outstanding instructions 
> as it resets the machine state.

Right. Popf generally has to flush the pipeline, _exactly_ because it 
actually changes machine state that normally isn't carried along, and that 
may even be part of the trace-cache state on P4 for all I know.

So you got the logic totally wrong. pushf is fast, popf is slow. pushf can 
pipeline, popf would quite commonly _flush_ the pipeline. 

Dependencies are _cheap_. What's expensive is state changes.

Anyway, I'm right, you're wrong. Why even bother arguing about it? Why are 
people arguing against REAL HARD NUMBERS that were posted by Zach?

Btw, for the P4 people out there who can't admit that they are wrong, just 
run this small program.

	#define pushfl(value) \
		asm volatile("pushfl ; popl %0":"=r" (value));
	#define popfl(value) \
		asm volatile("push %0 ; popfl": :"r" (value));
	#define rdtsc(value) \
		asm volatile("rdtsc":"=A" (value))
	
	int main(int argc, char ** argv)
	{
		unsigned long long a,b;
		unsigned long tmp;
	
		rdtsc(a);
		rdtsc(b);
		printf("%lld\n", b-a);
		rdtsc(a);
		pushfl(tmp);
		rdtsc(b);
		printf("%lld\n", b-a);
		rdtsc(a);
		popfl(tmp);
		rdtsc(b);
		printf("%lld\n", b-a);
		return 0;
	}

For me, when compiled with -O2, it results in

	84
	88
	132

which basically says: a "rdtsc->rdtsc" is 84 cycles, putting a "pushfl" in 
between is another _4_ cycles, and putting a "popfl" in between is about 
another 48 cycles. 

Now, those numbers aren't scientific, but they tell you something.

Now, tell me how popfl is faster again.

But dammit, back it up with real numbers and real logic this time. 

		Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux