Re: perfmon2 vector argument question

On Mon, 19 Jun 2006 13:40:12 -0700
Stephane Eranian <[email protected]> wrote:

> Hello,
> 
> The current perfmon2 API allows applications to pass vectors of arguments to
> certain calls, in particular to the 3 functions to read/write PMU registers. 
> This approach was chosen because it is very flexible and allows applications
> to modify either multiple or a single register in one call. It is extensible
> because there is no implicit knowledge of the actual number of registers supported
> by the underlying hardware.
> 
> Before entering the actual system call, the argument vector must be copied
> into a kernel buffer. This is required by convention for security and also
> fault reasons. The famous copy_from_user() and copy_to_user() are invoked.
> This must be done before interrupts are masked.
> 
> Vectors can have different sizes depending on the measurement, the PMU model.
> Yet, the vector must be copied into a kernel-level buffer. Today, we allocate
> the kernel-memory on demand based on the size of the vector. We use
> kmalloc/kfree. Of course, to avoid any abuse, we limit the size of the
> allocated region via a perfmon2 tunable in sysfs. By default, it is set
> to a page.
> 
> This implementation has worked fairly well, yet it costs some performance
> because kmalloc/kfree are expensive (especially kfree). Also it may seem
> overkill to malloc a page for small vectors.
> 
> I have run some experiments lately and they verified that kmalloc/kfree and
> copy to/from user account for a very large portion of the cost of calls with
> multiple registers (I tried with 4). For the copies it is hard to avoid
> them. One thing we could do is to try and reduce the size of the structs.
> Today, both struct pfarg_pmd and struct pfarg_pmc have reserved fields
> for future extensions so that we can extend without breaking the ABI.
> It may be possible to reduce those a little bit.
> 
> There are several ways to amortize or eliminate the kmalloc/kfree. First of
> all, it is important to understand that multiple threads may call into a 
> particular context at any time. All they need is access to the file descriptor.
> 
> An alternative that I have explored is to start from the hypothesis that
> most vectors are small. If they are small enough, we could avoid the
> kmalloc/kfree by using a buffer allocated on the stack. One could say
> if the vector is less than 8 elements, then use the stack buffer. If not, then
> go down the expensive path of kmalloc/kfree. I tried this experiment and got
> over 20% improvement for pfm_read_pmds(). I chose 8 as the threshold. The
> downside of this approach is that kernel stack space is limited and we should
> avoid allocating large buffers on it. The pfarg_pmd struct is about 176 bytes
> whereas pfarg_pmc_t is about 48 bytes. With 8 elements we reach 1408 bytes and
> this is true for all architectures including i386 where default kernel stack
> is 2 pages (8kB). Of course, the stack buffer could be adjusted per object
> type and per-architecture. The downside is that if you need to use kmalloc
> the stack space is still consumed.
> 
> It is important to note that we cannot use a kernel buffer of one element and simply
> loop over the vector. Because the copy_from/copy_to must be done without locks nor
> interrupt masked. So one  would have to copy, lock, do the perfmon call, unlock, copy
> and loop for the next element.
> 
> Another approach that was suggested to me is to allocate on demand but not kfree
> systematically when the call terminates. In other words, we amortize the cost
> of the allocation by keeping the buffer around for the next caller. To make
> this work, we would have to decompose the spin_lock_irq*() into spin_*lock()
> and local_irq_*able() to avoid a race condition. For the first caller the
> buffer would be allocated to fit the size (up to a certain limit like today).
> When the call terminates, the buffer is kept via a pointer in the perfmon
> context. The next caller, would check the pointer and size, if the buffer
> is big enough, copy_user could proceed directly, otherwise a new buffer would
> be allocated. That would also work assuming it is OKAY to copy_user with some locks
> held. I can see one issue with this approach as some malicious user could create
> lots of contexts and make one call for each to max out the argument vector limit for
> each. If you have 1024 descriptors and the limit is 1 page/context, it could allocate
> 1024 kernel pages (non-pageable) for nothing. Today, we do not have a global tuneable
> for the argument vector size limit. Adding one would be costly because multiple threads
> could potentially contend for it and therefore we would need yet another lock.
> 
> I do not see another approach at this point.
> 
> Does someone have something else to propose?
> 
> If not, what is your opinion of the two approaches above?
> 

The first approach should be fine - we do that in lots of places, such as
in core_sys_select().

Applications mut be calling this thing at a heck of a rate for kfree()
overhead to matter.  I trust CONFIG_DEBUG_SLAB wasn't turned on...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: perfmon2 vector argument question
  - From: Stephane Eranian <[email protected]>

References:
- perfmon2 vector argument question
  - From: Stephane Eranian <[email protected]>

Prev by Date: [PATCH] solve config broken: undefined reference to `online_page'
Next by Date: RE: Please pull from 'for_paulus' branch of powerpc
Previous by thread: perfmon2 vector argument question
Next by thread: Re: perfmon2 vector argument question
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]