Re: [RFC 0/8] Cpuset aware writeback

On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
> 
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
> 
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
> 
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
> 
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
> 
> After we have the cpuset throttling in place we can then make
> further fixups:
> 
> A. We can do inode based writeout from direct reclaim
>    avoiding single page writes to the filesystem.
> 
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>    from the available pages in a node. This allows us to
>    accurately calculate the dirty ratio even if large portions
>    of the node have been allocated for huge pages or for
>    slab pages.

What about mlock'ed pages?

> There are a couple of points where some better ideas could be used:
> 
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.

Not the prettiest indeed, no ideas though.

> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.

Well, you gain functionality, you loose some runtime, sad but probably
worth it.

Otherwise it all looks good.

Acked-by: Peter Zijlstra <[email protected]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC 0/8] Cpuset aware writeback
  - From: Christoph Lameter <[email protected]>

References:
- [RFC 0/8] Cpuset aware writeback
  - From: Christoph Lameter <[email protected]>

Prev by Date: 82571EB gigabit on e1000 in 2.6.20-rc5
Next by Date: [PATCH 2.6.20-rc5 2/4] pvrusb2: Use ARRAY_SIZE macro
Previous by thread: [RFC 6/8] Throttle vm writeout per cpuset
Next by thread: Re: [RFC 0/8] Cpuset aware writeback
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]