Re: light weight counters: race free through local_t?

Christoph Lameter wrote:

Good to know. But we would run into trouble if the atomic counters wouldbe contended.


Why do you think an atomic operation is more expensive in a contended case?

Let's take the following assumption:
Every time when a CPU wants to increment the counter, it's cache line is
in someone other's cache.

Case of an atomic operation:

1. Issuing an atomic operation (apparently can be absorbed)
2. Goes down by the OzQ of the L2 (no L1 action)
  (assuming the queue is "sufficiently empty")
3. L2 miss
4. Fetching data, obtaining exclusivity - very long
5. L2 is ready - do the atomic operation to the L2 entry - modified state
  (min 5 clock cycles for normal L2 access + 5 for the atomic operation)

Case of __get_per_cpu(var)++:

1. Issuing an "ld4" can be absorbed
2. Goes down by the OzQ of the L2 (L1 miss)
  (assuming the queue is "sufficiently empty")
3. L2 miss
4. Fetching data, shared state - very long
5. L2 is ready - copy to L1 - into rx
  (min 5 clock cycle due to the L2)
6. Increment (1 cycle)
7. Post store to the OzQ of the L2 (L1 updated)
  (assuming the queue is "sufficiently empty")
8. L2 hit - obtain exclusivity - moderately long
9. L2 is ready - update it - modified state
  (min 5 clock cycles)

Please note
- the additional bus operation (address only) in step 8
- a 3rd CPU can kill our data between steps 5 - 8
 (very low probability)

(In case of strong contention, efforts have to be made to avoid cache
line sharing with other frequently used data - as usually...)

Hmm... What about side effects such as pipeline stalls? fetchadd issemaphore operation. Typically we use acquire semantics for volatiles.Here the fetchadd has release semantics.

If we would use release semantics then the fetchadd would require allprior accesses to be complete.


... or ".rel".
Yes, it is a one-direction barrier.
However, if there is not too many stuff in the OzQ, it has not too much
impact.
(It is very difficult to fill in the OzQ: SW pipe-line, N independent
loads / stores without ";;" - not very much frequent in the kernel)

Acquire semantics may be easier. But the best would be a fetchadd withoutany serialization that would be like the inc/dec memory on i386, whichdoes not exist in the IA64 instruction set.


A second though about the case of __get_per_cpu(var)++:

If it is coded by hand, I can use "ld4.bias" to obtain immediately
the exclusivity - no need for step 8.
However, you cannot avoid the cost of the protection around this
incitement.

I still prefer the atomic operations.

Regard,

Zoltan


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: light weight counters: race free through local_t?
  - From: Christoph Lameter <[email protected]>

References:
- light weight counters: race free through local_t?
  - From: Christoph Lameter <[email protected]>
- Re: light weight counters: race free through local_t?
  - From: Zoltan Menyhart <[email protected]>
- Re: light weight counters: race free through local_t?
  - From: Christoph Lameter <[email protected]>
- Re: light weight counters: race free through local_t?
  - From: Zoltan Menyhart <[email protected]>
- Re: light weight counters: race free through local_t?
  - From: Christoph Lameter <[email protected]>

Prev by Date: Re: Network drivers - porting to 2.6 issues
Next by Date: [patch] cpu hotplug: fix CPU_UP_CANCEL handling
Previous by thread: Re: light weight counters: race free through local_t?
Next by thread: Re: light weight counters: race free through local_t?
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]