On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
>
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
>
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
>
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
>
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
>
> After we have the cpuset throttling in place we can then make
> further fixups:
>
> A. We can do inode based writeout from direct reclaim
> avoiding single page writes to the filesystem.
>
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
> from the available pages in a node. This allows us to
> accurately calculate the dirty ratio even if large portions
> of the node have been allocated for huge pages or for
> slab pages.
What about mlock'ed pages?
> There are a couple of points where some better ideas could be used:
>
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.
Not the prettiest indeed, no ideas though.
> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.
Well, you gain functionality, you loose some runtime, sad but probably
worth it.
Otherwise it all looks good.
Acked-by: Peter Zijlstra <[email protected]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]