Both one of my friends(who is working on a DBMS oriented from PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall. After careful code-reading and tests,i found out that the reason of the OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive, regardless of that the mlocked pages could not be reclaimed. Mlocking many pages will easily cause unbalance between LRU and slab: VM tend to reclaim from Active/Inactive list,most of which are mlocked, thus OOM may be triggered. While in fact,there are enough pages to be reclaimed in slab. ( Setting a large "vfs_cache_pressure" may help to avoid the OOM under this situation, but i think it's better "do things right" than depending on the "vfs_cache_pressure" tunable) We think that it's wrong semantic treating mlocked as Active/Inactive. Mlocked pages should not be counted in page-reclaiming algorithm, for in fact they will never be affected by page reclaims. Following patch patch try to fix this, with some additions. The patch brings Linux with: 1. Posix mlock/munlock/mlockall/munlockall. Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like, just as described in the manpage(2) of mlock/munlock/mlockall/munlockall. Thus users of mlock system call series will always have an clear map of mlocked areas. 2. More consistent LRU semantics in Memory Management. Mlocked pages is placed on a separate LRU list: Wired List. The pages dont take part in LRU algorithms,for they could never be swapped, until munlocked. 3. Output the Wired(mlocked) pages count through /proc/meminfo. One line is added to /proc/meminfo: "Wired: N kB",thus Linux system administrators/programmers can have a clearer map of physical memory usage. Test of the patch: Test envioronment: RHEL4. Totoal physical memory size: 256MB,no swap. One ext3 directory("/mnt/test") with about 256 thousand small files (each size: 2kB). Step 1. run a task mlocking 220 MB Step 2. run: "find /mnt/test -size 100" Case A. Standard kernel.org kernel 2.6.15 Linux soon run OOM, OOM-time memory info: [root@Linux ~]# cat /proc/meminfo MemTotal: 254248 kB MemFree: 3144 kB Buffers: 124 kB Cached: 1584 kB SwapCached: 0 kB Active: 229308 kB Inactive: 596 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 254248 kB LowFree: 3144 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 0 kB Writeback: 0 kB Mapped: 228556 kB Slab: 20076 kB CommitLimit: 127124 kB Committed_AS: 238424 kB PageTables: 584 kB VmallocTotal: 770040 kB VmallocUsed: 180 kB VmallocChunk: 769844 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB Case B. Patched 2.6.15 No OOM happened. [root@Linux ~]# cat /proc/meminfo MemTotal: 254344 kB MemFree: 3508 kB Buffers: 6352 kB Cached: 2684 kB SwapCached: 0 kB Active: 7140 kB Inactive: 4732 kB Wired: 225284 kB HighTotal: 0 kB HighFree: 0 kB LowTotal: 254344 kB LowFree: 3508 kB SwapTotal: 0 kB SwapFree: 0 kB Dirty: 72 kB Writeback: 0 kB Mapped: 229208 kB Slab: 12552 kB CommitLimit: 127172 kB Committed_AS: 238168 kB PageTables: 572 kB VmallocTotal: 770040 kB VmallocUsed: 180 kB VmallocChunk: 769844 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 4096 kB A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and LDD3. FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys. Attachment is the full patch,following mails are what it splits up,. Shaoping Wang
Attachment:
patch-2.6.15-memlock
Description: Binary data
- Follow-Ups:
- Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- From: Andi Kleen <[email protected]>
- Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- From: Nick Piggin <[email protected]>
- Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- From: Christoph Lameter <[email protected]>
- Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- From: Arjan van de Ven <[email protected]>
- Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- Prev by Date: Re: TSO and IPoIB performance degradation
- Next by Date: PATCH][1/8] 2.6.15 mlock: make_pages_wired/unwired
- Previous by thread: Re: Emulex IP over FC support.
- Next by thread: Re: [PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
- Index(es):