Re: [00/17] [RFC] Virtual Compound Page Support

Hi Christoph,

On 19 Sep 2007, at 04:36, Christoph Lameter wrote:

Currently there is a strong tendency to avoid larger pageallocations in
the kernel because of past fragmentation issues and the current
defragmentation methods are still evolving. It is not clear to whatextend
they can provide reliable allocations for higher order pages (plus the
definition of "reliable" seems to be in the eye of the beholder).
Currently we use vmalloc allocations in many locations to provide asafeway to allocate larger arrays. That is due to the danger of higherorder
allocations failing. Virtual Compound pages allow the use of regular
page allocator allocations that will fall back only if there is anactual
problem with acquiring a higher order page.
This patch set provides a way for a higher page allocation to fallback.
Instead of a physically contiguous page a virtually contiguous page
is provided. The functionality of the vmalloc layer is used to provide
the necessary page tables and control structures to establish avirtually
contiguous area.

I like this a lot. It will get rid of all the silly games we have toplay when needing both large allocations and efficient allocationswhere possible. In NTFS I can then just allocated higher order pagesinstead of having to mess about with the allocation size andallocating a single page if the requested size is <= PAGE_SIZE orusing vmalloc() if the size is bigger. And it will make it fasterbecause a lot of the time a higher order page allocation will succeedwith your patchset without resorting to vmalloc() so that will be alot faster.

So where I currently have fs/ntfs/malloc.h the below mess I could getrid of it completely and just use the normal page allocator/deallocator instead...

static inline void *__ntfs_malloc(unsigned long size, gfp_t gfp_mask)
{
        if (likely(size <= PAGE_SIZE)) {
                BUG_ON(!size);

/* kmalloc() has per-CPU caches so is faster fornow. */

                return kmalloc(PAGE_SIZE, gfp_mask & ~__GFP_HIGHMEM);
                /* return (void *)__get_free_page(gfp_mask); */
        }
        if (likely(size >> PAGE_SHIFT < num_physpages))
                return __vmalloc(size, gfp_mask, PAGE_KERNEL);
        return NULL;
}

And other places in the kernel can make use of the same. I think XFSdoes very similar things to NTFS in terms of larger allocations atleast and there are probably more places I don't know about off thetop of my head...

I am looking forward to your patchset going into mainline.  (-:

Best regards,

	Anton

Advantages:

- If higher order allocations are failing then virtual compound pages
  consisting of a series of order-0 pages can stand in for those
  allocations.
- "Reliability" as long as the vmalloc layer can provide virtualmappings.
- Ability to reduce the use of vmalloc layer significantly by using
  physically contiguous memory instead of virtual contiguous memory.
  Most uses of vmalloc() can be converted to page allocator calls.
- The use of physically contiguous memory instead of vmalloc mayallow theuse larger TLB entries thus reducing TLB pressure. Also reducesthe need
  for page table walks.

Disadvantages:

- In order to use fall back the logic accessing the memory must be
  aware that the memory could be backed by a virtual mapping and take
  precautions. virt_to_page() and page_address() may not work and
  vmalloc_to_page() and vmalloc_address() (introduced through this
  patch set) may have to be called.

- Virtual mappings are less efficient than physical mappings.
  Performance will drop once virtual fall back occurs.
- Virtual mappings have more memory overhead. vm_area controlstructurespage tables, page arrays etc need to be allocated and managed toprovide
  virtual mappings.

The patchset provides this functionality in stages. Stage 1 introduces
the basic fall back mechanism necessary to replace vmalloc allocations
with

	alloc_page(GFP_VFALLBACK, order, ....)
which signifies to the page allocator that a higher order is to befoundbut a virtual mapping may stand in if there is an issue withfragmentation.
Stage 1 functionality does not allow allocation and freeing of virtual
mappings from interrupt contexts.
The stage 1 series ends with the conversion of a few key uses ofvmallocin the VM to alloc_pages() for the allocation of sparsemems memmaptableand the wait table in each zone. Other uses of vmalloc could beconverted
in the same way.
Stage 2 functionality enhances the fallback even more allowingallocation
and frees in interrupt context.

SLUB is then modified to use the virtual mappings for slab caches
that are marked with SLAB_VFALLBACK. If a slab cache is marked thisway
then we drop all the restraints regarding page order and allocate
good large memory areas that fit lots of objects so that we rarely
have to use the slow paths.
Two slab caches--the dentry cache and the buffer_heads--are thenflagged
that way. Others could be converted in the same way.

The patch set also provides a debugging aid through setting

	CONFIG_VFALLBACK_ALWAYS

If set then all GFP_VFALLBACK allocations fall back to the virtual
mappings. This is useful for verification tests. The test of this
patch set was done by enabling that options and compiling a kernel.


Stage 3 functionality could be the adding of support for the large
buffer size patchset. Not done yet and not sure if it would be useful
to do.
Much of this patchset may only be needed for special cases in whichtheexisting defragmentation methods fail for some reason. It may bebetter tohave the system operate without such a safety net and make surethat the
page allocator can return large orders in a reliable way.

The initial idea for this patchset came from Nick Piggin's fsblock
and from his arguments about reliability and guarantees. Since his
fsblock uses the virtual mappings I think it is legitimate to
generalize the use of virtual mappings to support higher order
allocations in this way. The application of these ideas to the large
block size patchset etc are straightforward. If wanted I can base
the next rev of the largebuffer patchset on this one and implement
fallback.
Contrary to Nick, I still doubt that any of this provides a"guarantee".
Have said that I have to deal with various failure scenarios in the VM
daily and I'd certainly like to see it work in a more reliable manner.

IMHO getting rid of the various workarounds to deal with the small 4k
pages and avoiding additional layers that group these pages insubsystem
specific ways is something that can simplify the kernel and make the
kernel more reliable overall.

If people feel that a virtual fall back is needed then so be it. Maybe
we can shed our security blanket later when the approaches to deal
with fragmentation have matured.
The patch set is also available via git from the largeblock gittree via
git pull
git://git.kernel.org/pub/scm/linux/kernel/git/christoph/largeblocksize.git
    vcompound

--
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Best regards,

	Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [00/17] [RFC] Virtual Compound Page Support
  - From: Eric Dumazet <dada1@cosmosbay.com>

References:
- [00/17] [RFC] Virtual Compound Page Support
  - From: Christoph Lameter <clameter@sgi.com>

Prev by Date: Re: 2.6.23-rc6-mm1 - Mostly working, with a kbuild oddity
Next by Date: Re: Problem: one driver and 4 instances with different parameters
Previous by thread: Re: [05/17] vunmap: return page array
Next by thread: Re: [00/17] [RFC] Virtual Compound Page Support
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]