Zachary Amsden wrote:
Benjamin LaHaise wrote:
On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
Every processor I've ever measured it on, popf is slower. On P4, for
example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 /
12. On Xeon, it is 7 / 91.
pushf has to wait until all flag dependancies can be resolved. On the
P4 with >100 instructions in flight, that can take a long time. Popf
on the other hand has no dependancies on outstanding instructions as
it resets the machine state.
Yes, but as Linus points out popf is most likely microcoded, thus much
slower. Flag dependency is not unique to pushf, many much more common
instructions (adc, jcc, sbc, cmovcc, movs, stos, ...) have flag
dependencies, which can still be pipeline forwarded. I think the raw
cycle counts speak for themselves, despite the fact that I only measured
instruction latency, not throughput. Using a branch to eliminate a
pushf is thus probably not a win in most cases.
The "sane" decomposition of popf into uops something like this:
- Memory read
- Mask bits that are immutable in the current mode
- Trap to microcode on changing any bit that alters the pipeline state
The "trap to microcode" can obviously be arbitrarily expensive. So when
timing popf, it's also important to know *which* bits change.
The simplest case, obviously, is when no flags change. I still *fully*
expect that to be more painful than pushf ever is.
-hpa
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]