Re: [00/17] Large Blocksize Support V3

On (27/04/07 20:05), Nick Piggin didst pronounce:
> Christoph Hellwig wrote:
> >On Thu, Apr 26, 2007 at 05:48:12PM +1000, Nick Piggin wrote:
> >
> >>>Well maybe you could explain what you want. Preferably without 
> >>>redefining the established terms?
> >>
> >>Support for larger buffers than page cache pages.
> >
> >
> >I don't think you really want this :)  The whole non-pagecache I/O
> >path before 2.3 was a toal pain just because it used buffers to drive
> >I/O.  Add to that buffers bigger than a page and you add another
> >two mangnitudes of complexity.  If you want to see a mess like that
> >download on of the eary XFS/Linux releases that had an I/O path
> >like that.  I _really_ _really_ don't want to go there.
> 
> I'm not actually suggesting to add anything like that. But I think
> larger blocks can be doable while retaining the "buffer" layer as a
> relatively simple pagecache to block translation.
> 
> Anyway, I'm working on patches... they might crash and burn, but we
> might have something to talk about later.
> 
> 
> >Linux has a long tradition of trading a tiny bit of efficieny for
> >much cleaner code, and I'd for 100% go down Christoph's route here.
> >Then again I'd actually be rather surprised if > page buffers
> >were more efficient - you'd run into shitloads over overhead due to
> >them beeing non-contingous like calling vmap all over the place,
> >reprogramming iommus to at least make them look virtually contingous [1],
> >etc..
> 
> I still think hardware should work reasonably well with 4K pages. The
> SGI io controllers and/or the Linux block layer that doesn't allow more
> than 128 sg entries is clearly suboptimal if the hardware runs twice as
> fast with 2MB submissions.
> 
> 
> >I also don't quite get what your problem with higher order allocations
> >are.  order 1 allocations are generally just fine, and in fact
> >thread stacks are >= oder 1 on most architectures.  And if the pagecache
> >uses higher order allocations that means we'll finally fix our problems
> >with them, which we have to do anyway.  Workloads continue to grow and
> >with them the kernel overhead to manage them, while the pagesize for
> >many architectures is fixed.  So we'll have to deal with order 1
> >and order 2 allocations better just for backing kmalloc and co.
> 
> The pagecache is much bigger and often a lot more activity than these
> other things though. Also, the more things you add to higher order
> allocations, the more pressure you have.
> 
> I like PAGE_SIZE pagecache, because it is reliable and really fast, if
> you need to reclaim a page it should be almost O(1).
> 
> 
> >Or think jumboframes for that matter.
> 
> They can actually run into problems if the hardware wants contiguous
> memory.
> 
> I don't know why you think the fragmentation issues are just magically
> fixed. It is hard and inefficient to reclaim larger order blocks (even
> with lumpy reclaim), and Mel's patches aren't perfect. Actually, last
> time I looked, they needed to keep at least 16MB of pages free to be
> reasonably effective (or do we just say that people with less than XMB
> of memory shouldn't be accessing these filesystems anyway?)

It'll work without adjusting the min_free_kbytes at all. The 16MB free had
better results after fragmentation stress tests but this was a few percent
of memory when allocating as huge pages as opposed to it falling apart. The
success rates were still way way higher than the vanilla kernel.

>, and I'm
> not sure if they have been tested for long term stability in the
> presence of a reasonable amount of higher order allocations.
> 

I don't have a sample workload that has reasonable amount of higher order
allocations over longer period of time. When the next -mm comes out, SLUB will
be able to use high-order pages so I'll boot my machine with less memory to
pressure it more. Assuming the kernel boots on my desktop machine, I should
get some idea of what its long-term behaviour looks like.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Re: [00/17] Large Blocksize Support V3
  - From: ebiederm@xmission.com (Eric W. Biederman)
- Re: [00/17] Large Blocksize Support V3
  - From: Nick Piggin <nickpiggin@yahoo.com.au>
- Re: [00/17] Large Blocksize Support V3
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [00/17] Large Blocksize Support V3
  - From: Nick Piggin <nickpiggin@yahoo.com.au>
- Re: [00/17] Large Blocksize Support V3
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [00/17] Large Blocksize Support V3
  - From: Nick Piggin <nickpiggin@yahoo.com.au>
- Re: [00/17] Large Blocksize Support V3
  - From: Christoph Lameter <clameter@sgi.com>
- Re: [00/17] Large Blocksize Support V3
  - From: Nick Piggin <nickpiggin@yahoo.com.au>
- Re: [00/17] Large Blocksize Support V3
  - From: Christoph Hellwig <hch@infradead.org>
- Re: [00/17] Large Blocksize Support V3
  - From: Nick Piggin <nickpiggin@yahoo.com.au>

Prev by Date: Re: [00/17] Large Blocksize Support V3
Next by Date: Re: [PATCH] serial 8250: move push calls out of lock
Previous by thread: Re: [00/17] Large Blocksize Support V3
Next by thread: Re: [00/17] Large Blocksize Support V3
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]