Re: [00/41] Large Blocksize Support V7 (adds memmap support)

On Tuesday 11 September 2007 05:26:05 Nick Piggin wrote:
> On Wednesday 12 September 2007 04:31, Mel Gorman wrote:
> > On Tue, 2007-09-11 at 18:47 +0200, Andrea Arcangeli wrote:
> > > Hi Mel,
> >
> > Hi,
> >
> > > On Tue, Sep 11, 2007 at 04:36:07PM +0100, Mel Gorman wrote:
> > > > that increasing the pagesize like what Andrea suggested would lead to
> > > > internal fragmentation problems. Regrettably we didn't discuss Andrea's
> > >
> > > The config_page_shift guarantees the kernel stacks or whatever not
> > > defragmentable allocation other allocation goes into the same 64k "not
> > > defragmentable" page. Not like with SGI design that a 8k kernel stack
> > > could be allocated in the first 64k page, and then another 8k stack
> > > could be allocated in the next 64k page, effectively pinning all 64k
> > > pages until Nick worst case scenario triggers.
> >
> > In practice, it's pretty difficult to trigger. Buddy allocators always
> > try and use the smallest possible sized buddy to split. Once a 64K is
> > split for a 4K or 8K allocation, the remainder of that block will be
> > used for other 4K, 8K, 16K, 32K allocations. The situation where
> > multiple 64K blocks gets split does not occur.
> >
> > Now, the worst case scenario for your patch is that a hostile process
> > allocates large amount of memory and mlocks() one 4K page per 64K chunk
> > (this is unlikely in practice I know). The end result is you have many
> > 64KB regions that are now unusable because 4K is pinned in each of them.
> > Your approach is not immune from problems either. To me, only Nicks
> > approach is bullet-proof in the long run.
> 
> One important thing I think in Andrea's case, the memory will be accounted
> for (eg. we can limit mlock, or work within various memory accounting things).
> 
> With fragmentation, I suspect it will be much more difficult to do this. It
> would be another layer of heuristics that will also inevitably go wrong
> at times if you try to limit how much "fragmentation" a process can do.
> Quite likely it is hard to make something even work reasonably well in
> most cases.
> 
> 
> > >  We can still try to save some memory by
> > > defragging the slab a bit, but it's by far *not* required with
> > > config_page_shift. No defrag at all is required infact.
> >
> > You will need to take some sort of defragmentation to deal with internal
> > fragmentation. It's a very similar problem to blasting away at slab
> > pages and still not being able to free them because objects are in use.
> > Replace "slab" with "large page" and "object" with "4k page" and the
> > issues are similar.
> 
> Well yes and slab has issues today too with internal fragmentation,
> targetted reclaim and some (small) higher order allocations too today.
> But at least with config_page_shift, you don't introduce _new_ sources
> of problems (eg. coming from pagecache or other allocs).
> 
> Sure, there are some other things -- like pagecache can actually use
> up more memory instead -- but there are a number of other positives
> that Andrea's has as well. It is using order-0 pages, which are first class
> throughout the VM; they have per-cpu queues, and do not require any
> special reclaim code. They also *actually do* reduce the page
> management overhead in the general case, unlike higher order pcache.
> 
> So combined with the accounting issues, I think it is unfair to say that
> Andrea's is just moving the fragmentation to internal. It has a number
> of upsides. I have no idea how it will actually behave and perform, mind
> you ;)
> 
> 
> > > Plus there's a cost in defragging and freeing cache... the more you
> > > need defrag, the slower the kernel will be.
> > >
> > > > approach in depth.
> > >
> > > Well it wasn't my fault if we didn't discuss it in depth though.
> >
> > If it's my fault, sorry about that. It wasn't my intention.
> 
> I think it did get brushed aside a little quickly too (not blaming anyone).
> Maybe because Linus was hostile. But *if* the idea is that page
> management overhead has or will become a problem that needs fixing,
> then neither higher order pagecache, nor (obviously) fsblock, fixes this
> properly. Andrea's most definitely has the potential to.
> 

Hi,

I think that fundamental problem is no fragmentation/large pages/...

The problem is the VM itself.
The vm doesn't use virtual memory, thats all, that the problem.
Although this will be probably linux 3.0, I think that the right way to solve all those problems 
is to make all kernel memory vmalloced (except few areas like kernel .text)
 
It will suddenly remove the buddy allocator, it will remove need for highmem, it will allow to allocate any amount of memory
(for example 4k stacks will be obsolete)
It will even allow kernel memory to be swapped to disk.

This is the solution, but it is very very hard.

Best regards,
	Maxim Levitsky
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
  - From: Nick Piggin <nickpiggin@yahoo.com.au>

References:
- [00/41] Large Blocksize Support V7 (adds memmap support)
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
  - From: Mel Gorman <mel@csn.ul.ie>
- Re: [00/41] Large Blocksize Support V7 (adds memmap support)
  - From: Nick Piggin <nickpiggin@yahoo.com.au>

Prev by Date: [PATCH v2] Move the definition of pr_err() into kernel.h
Next by Date: Re: clockevents: fix resume logic
Previous by thread: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Next by thread: Re: [00/41] Large Blocksize Support V7 (adds memmap support)
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]