Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

From: Arjan van de Ven <[email protected]>
Subject: Re: [RFC] [PATCH] cache pollution aware __copy_from_user_ll()
Date: Sun, 14 Aug 2005 12:35:43 +0200
Message-ID: <[email protected]>

> On Sun, 2005-08-14 at 19:22 +0900, Hiro Yoshioka wrote:
> > Thanks for your comments.
> > 
> > On 8/14/05, Arjan van de Ven <[email protected]> wrote:
> > > On Sun, 2005-08-14 at 18:16 +0900, Hiro Yoshioka wrote:
> > > > Hi,
> > > >
> > > > The following is a patch to reduce a cache pollution
> > > > of __copy_from_user_ll().
> > > >
> > > > When I run simple iozone benchmark to find a performance bottleneck of
> > > > the linux kernel, I found that __copy_from_user_ll() spent CPU cycle
> > > > most and it did many cache misses.
> > > 
> > > 
> > > however... you copy something from userspace... aren't you going to USE
> > > it? The non-termoral versions actually throw the data out of the
> > > cache... so while this part might be nice, you pay BIG elsewhere....
> > 
> > The oprofile data does not give an evidence that we pay BIG elsewhere.
> 
> 
> the problem is that the pay elsewhere is far more spread out, but not
> less. At least generally....
> 
> I can see the point of a copy_from_user_nocache() or something, for
> those cases where we *know* we are not going to use the copied data in
> the cpu (but say, only do DMA).
> But that should be explicit, not implicit, since the general case will
> be that the kernel WILL use the data. And if that's the case your change
> is a loss.... (just harder to see because the cost is spread out)

I understand the iozone is not good benchmark nor reprsents any useful
application so I did a kernel build as a simple benchmark.

What I did is
    cd /test/f1
    tar xjf ${baseDir}/src/linux-2.6.12.4.tar.bz2
    cd linux-2.6.12.4
    cp -p ${baseDir}/src/config .config
    make oldconfig
    time make -j $CPUS

The following is Top 5 of CPU cycle
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
0000
samples  %        app name                 symbol name
7347544  72.8296  cc1                      (no symbols)
532307    5.2763  libbz2.so.1.0.2          (no symbols)
241853    2.3973  vmlinux                  buffered_rmqueue
128552    1.2742  libc-2.3.4.so            _int_malloc
107784    1.0684  vmlinux                  page_fault
...
10749     0.1065  vmlinux                  __copy_from_user_ll
pattern12-0-cpu4-0-08150920/summary.out

Since __copy_from_user_ll is not hot spot, so we didn't see any big
performance difference. (the number is time (sec) of 5 runs)

original 2.6.12.4               real    user    system
No profiling                    532.27	1797.02	194.9
BSQ 0x200+0x3f                  620.15	2094.21	212.38
GLOBAL_POWER_EVENTS:100000:	586.01	1984.92	215.97

cache aware 2.6.12.4            real    user    system
No profiling                    526.65	1792.22	190.05
BSQ 0x200+0x3f                  615.51	2090.74	206.58
GLOBAL_POWER_EVENTS:100000:     587.69	1978.66	209.18

Now Top 5 of Memory Access (2.6.12.4)
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x3f (multiple flags) count 3000
Counted BSQ_CACHE_REFERENCE events (cache references seen by the bus
unit) with a unit mask of 0x200 (read 3rd level cache miss) count 3000
samples  %        samples  %        app name             symbol name
11439689 82.2135  33906    27.9328  cc1                  (no symbols)
277177    1.9920  347       0.2859  libc-2.3.4.so        _int_malloc
229593    1.6500  12946    10.6653  libbz2.so.1.0.2      (no symbols)
84348     0.6062  116       0.0956  libc-2.3.4.so        _int_free
83653     0.6012  438       0.3608  libc-2.3.4.so        calloc
...
8527      0.0613  1648      1.3577  vmlinux              __copy_from_user_ll

Top 5 of Cache miss
33906   27.9328 cc1                     (no symbols)
30849   25.4144 vmlinux                 buffered_rmqueue
12946   10.6653 libbz2.so.1.0.2         (no symbols)
9178    7.5611  vmlinux                 __copy_to_user_ll
2934    2.4171  oprofiled               (no symbols)
...
1648    1.3577  vmlinux                 __copy_from_user_ll
pattern12-0-cpu4-0-08150917

Cache aware 2.6.12.4, Top 5 of Memory Access
samples  %        samples  %        app name             symbol name
11448487 82.8100  32786    28.1051  cc1                  (no symbols)
276812    2.0023  256       0.2195  libc-2.3.4.so        _int_malloc
230177    1.6649  12371    10.6048  libbz2.so.1.0.2      (no symbols)
84485     0.6111  120       0.1029  libc-2.3.4.so        _int_free
84043     0.6079  473       0.4055  libc-2.3.4.so        calloc
...
18282     0.1322  9060      7.7665  vmlinux              __copy_from_user_ll

Top 5 of Cache miss
32786   28.1051 cc1                     (no symbols)
31175   26.7241 vmlinux                 buffered_rmqueue
12371   10.6048 libbz2.so.1.0.2         (no symbols)
9060    7.7665  vmlinux                 __copy_from_user_ll
2801    2.4011  oprofiled               (no symbols)
...
0            0  vmlinux                 __copy_to_user_ll
pattern12-0-cpu4-0-08151048

Cache miss of __copy_from_user_ll has been increased but
__copy_to_user_ll has been decreased to 0. (oprofile could not get a
sample.)

I don't know the reason why __copy_to_user_ll has been decreased.

Anyway we could not find the cache aware version of __copy_from_user_ll
has a big regression yet.

What do you think?
  Hiro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]
  Powered by Linux