Re: [RFC 0/8] Cpuset aware writeback

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Tue, 16 Jan 2007 14:15:56 -0800 (PST) Christoph Lameter <[email protected]> wrote:
>
> ...
>
> > > This may result in a large percentage of a cpuset
> > > to become dirty without writeout being triggered. Under NFS
> > > this can lead to OOM conditions.
> > 
> > OK, a big question: is this patchset a performance improvement or a
> > correctness fix?  Given the above, and the lack of benchmark results I'm
> > assuming it's for correctness.
> 
> It is a correctness fix both for NFS OOM and doing proper cpuset writeout.

It's a workaround for a still-unfixed NFS problem.

> > - Why does NFS go oom?  Because it allocates potentially-unbounded
> >   numbers of requests in the writeback path?
> > 
> >   It was able to go oom on non-numa machines before dirty-page-tracking
> >   went in.  So a general problem has now become specific to some NUMA
> >   setups.
> 
> 
> Right. The issue is that large portions of memory become dirty / 
> writeback since no writeback occurs because dirty limits are not checked 
> for a cpuset. Then NFS attempt to writeout when doing LRU scans but is 
> unable to allocate memory.
>  
> >   So an obvious, equivalent and vastly simpler "fix" would be to teach
> >   the NFS client to go off-cpuset when trying to allocate these requests.
> 
> Yes we can fix these allocations by allowing processes to allocate from 
> other nodes. But then the container function of cpusets is no longer 
> there.

But that's what your patch already does!

It asks pdflush to write the pages instead of the direct-reclaim caller. 
The only reason pdflush doesn't go oom is that pdflush lives outside the
direct-reclaim caller's cpuset and is hence able to obtain those nfs
requests from off-cpuset zones.

> > (But is it really bad? What actual problems will it cause once NFS is fixed?)
> 
> NFS is okay as far as I can tell. dirty throttling works fine in non 
> cpuset environments because we throttle if 40% of memory becomes dirty or 
> under writeback.

Repeat: NFS shouldn't go oom.  It should fail the allocation, recover, wait
for existing IO to complete.  Back that up with a mempool for NFS requests
and the problem is solved, I think?

> > I don't understand why the proposed patches are cpuset-aware at all.  This
> > is a per-zone problem, and a per-zone fix would seem to be appropriate, and
> > more general.  For example, i386 machines can presumably get into trouble
> > if all of ZONE_DMA or ZONE_NORMAL get dirty.  A good implementation would
> > address that problem as well.  So I think it should all be per-zone?
> 
> No. A zone can be completely dirty as long as we are allowed to allocate 
> from other zones.

But we also can get into trouble if a *zone* is all-dirty.  Any solution to
the cpuset problem should solve that problem too, no?

> > Do we really need those per-inode cpumasks?  When page reclaim encounters a
> > dirty page on the zone LRU, we automatically know that page->mapping->host
> > has at least one dirty page in this zone, yes?  We could immediately ask
> 
> Yes, but when we enter reclaim most of the pages of a zone may already be 
> dirty/writeback so we fail.

No.  If the dirty limits become per-zone then no zone will ever have >40%
dirty.

The obvious fix here is: when a zone hits 40% dirty, perform dirty-memory
reduction in that zone, throttling the dirtying process.  I suspect this
would work very badly in common situations with, say, typical i386 boxes.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux