[x86_64] Strange oprofile results on access to per_cpu data

Hi Andi

While doing some oprofile analysis, I got this result on ip_route_input() :one particular instruction seems to spend a lot of cycles.


machine is a dual core 285, 2.6 GHz

/*

* Command line: opannotate -a event:CPU_CLK_UNHALTED/usr/src/linux-2.6.18/vmlinux

 *
 * Interpretation of command line:
 * Output annotated assembly listing with samples
 *
 * CPU: AMD64 processors, speed 2600.01 MHz (estimated)

* Counted CPU_CLK_UNHALTED events (Cycles outside of halt state) with a unitmask of 0x00 (No unit mask) count 10000

 */

ffffffff803e9860 <ip_route_input>: /* ip_route_input total: 543098  2.5487 */

/* relevant extract from ip_route_input() */
   600  0.0028 :ffffffff803e98b3:       mov    $0xffffffff806375e0,%rsi
   883  0.0041 :ffffffff803e98ba:       mov    %rax,%rdx
     6 2.8e-05 :ffffffff803e98bd:       mov    %rsi,%rcx
  2281  0.0107 :ffffffff803e98c0:       cmp    0xf0(%rdx),%r12d

9767 0.0458 :ffffffff803e98c7: jne ffffffff803e98f1<ip_route_input+0x91>

   108 5.1e-04 :ffffffff803e98c9:       cmp    0xf4(%rdx),%r14d

41459 0.1946 :ffffffff803e98d0: jne ffffffff803e98f1<ip_route_input+0x91>

   549  0.0026 :ffffffff803e98d2:       cmp    0xec(%rdx),%ebx

88604 0.4158 :ffffffff803e98d8: jne ffffffff803e98f1<ip_route_input+0x91>

   478  0.0022 :ffffffff803e98da:       mov    0xe8(%rdx),%eax
   315  0.0015 :ffffffff803e98e0:       test   %eax,%eax

241 0.0011 :ffffffff803e98e2: jne ffffffff803e98f1<ip_route_input+0x91>

   248  0.0012 :ffffffff803e98e4:       cmp    0xfc(%rdx),%r13b

  2314  0.0109 :ffffffff803e98eb:       je     ffffffff803ea3b3
################ BEGIN
   370  0.0017 :ffffffff803e98f1:       mov    %gs:0x8,%rax
222769  1.0454 :ffffffff803e98fa:       incl   0x38(%rcx,%rax,1)
################ END
     6 2.8e-05 :ffffffff803e98fe:       mov    (%rdx),%rdx
   833  0.0039 :ffffffff803e9901:       test   %rdx,%rdx

__raw_get_cpu_var(rt_cache_stat).field++ appears to be very expensive

(about 18000 RT_CACHE_STAT_INC(in_hlist_search); are done per second, not animpressive count in fact)


Are segment prefixes that expensive ?
Or is it only the first access to %gs:8 that is doing extra checks ?
(because other RT_CACHE_STAT_INC() done in the same function dont have this cost)
Or is it the loading of %rcx (done in ffffffff803e98bd) that is stalling ?

I was wondering if avoiding a dependancy would help :

As we dont have TLS support in kernel yet, I was considering trying (just forexperimentation) to stick a struct rt_cache_stat in pda, since it avoids one step.


#if defined(RT_CACHE_STAT_IN_PDA)
# define RT_CACHE_STAT_INC(field) add_pda(rt_cache_stat.field, 1)
# define addr_of_rt_cache_stat(cpu) &cpu_pda(cpu)->rt_cache_stat
#else
  static DEFINE_PER_CPU(struct rt_cache_stat, rt_cache_stat);
# define RT_CACHE_STAT_INC(field) (__raw_get_cpu_var(rt_cache_stat).field++)
# define addr_of_rt_cache_stat(cpu) &per_cpu(rt_cache_stat, cpu)
#endif

so that RT_CACHE_STAT_INC(field) would map to

   addl $1,%gs:OFFSET  /* no register needed */

Thank you
Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [x86_64] Strange oprofile results on access to per_cpu data
  - From: Andi Kleen <[email protected]>

References:
- [PATCH 0/7] x86 paravirtualization infrastructure
  - From: Chris Wright <[email protected]>
- Re: [PATCH 6/7] Add APIC accessors to paravirt-ops.
  - From: Andi Kleen <[email protected]>
- [PATCH 1/7] paravirtualization: header and stubs for paravirtualizing critical operations
  - From: Rusty Russell <[email protected]>
- Re: [PATCH 1/7] paravirtualization: header and stubs for paravirtualizing critical operations
  - From: Andi Kleen <[email protected]>

Prev by Date: Re: 2.6.19-rc4 - tifm_7xx1 does not work after suspend-to-disk
Next by Date: Re: [2.6.18] Suspend to ram and SATA
Previous by thread: Re: [PATCH 1/7] paravirtualization: header and stubs for paravirtualizing critical operations
Next by thread: Re: [x86_64] Strange oprofile results on access to per_cpu data
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]