There is a fundamental deadlock associated with paging; when writing out a page to free memory requires free memory to complete. The usually solution is to keep a small amount of memory available at all times so we can overcome this problem. This however assumes the amount of memory needed for writeout is (constant and) smaller than the provided reserve. It is this latter assumption that breaks when doing writeout over network. Network can take up an unspecified amount of memory while waiting for a reply to our write request. This re-introduces the deadlock; we might never complete the writeout, for we might not have enough memory to receive the completion message. The proposed solution is simple, only allow traffic servicing the VM to make use of the reserves. This however implies you know what packets are for whom, which generally speaking you don't. Hence we need to receive all packets but discard them as soon as we encounter a non VM bound packet allocated from the reserves. Also knowing it is headed towards the VM needs a little help, hence we introduce the socket flag SOCK_VMIO to mark sockets with. Of course, since we are paging all this has to happen in kernel-space, since user-space might just not be there. Since packet processing might also require memory, this all also implies that those auxiliary allocations may use the reserves when an emergency packet is processed. This is accomplished by using PF_MEMALLOC. How much memory is to be reserved is also an issue, enough memory to saturate both the route cache and IP fragment reassembly, along with various constants. This patch-set comes in 6 parts: 1) introduce the memory reserve and make the SLAB allocator play nice with it. patches 01-10 2) add some needed infrastructure to the network code patches 11-13 3) implement the idea outlined above patches 14-20 4) teach the swap machinery to use generic address_spaces patches 21-24 5) implement swap over NFS using all the new stuff patches 25-31 6) implement swap over iSCSI patches 32-40 Patches can also be found here: http://programming.kicks-ass.net/kernel-patches/vm_deadlock/v12/ If I receive no feedback, I will assume the various maintainers do not object and I will respin the series against -mm and submit for inclusion. There is interest in this feature from the stateless linux world; that is both the virtualization world, and the cluster world. I have been contacted by various groups, some have just expressed their interest, others have been testing this work in their environments. Various hardware vendors have also expressed interest, and, of course, my employer finds it important enough to have me work on it. Also, while it doesn't present a full-fledged reserve-based allocator API yet, it does lay most of the groundwork for it. There is a GFP_NOFAIL elimination project wanting to use this as a foundation. Elimination of GFP_NOFAIL will greatly improve the basic soundness and stability of the code that currently uses that construct - most disk based filesystems. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
- Follow-Ups:
- Re: [PATCH 00/40] Swap over Networked storage -v12
- From: David Miller <[email protected]>
- Re: [PATCH 00/40] Swap over Networked storage -v12
- From: Daniel Walker <[email protected]>
- [PATCH 26/40] nfs: teach the NFS client how to treat PG_swapcache pages
- From: Peter Zijlstra <[email protected]>
- [PATCH 06/40] mm: __GFP_EMERGENCY
- From: Peter Zijlstra <[email protected]>
- [PATCH 05/40] mm: emergency pool
- From: Peter Zijlstra <[email protected]>
- [PATCH 17/40] netvm: filter emergency skbs.
- From: Peter Zijlstra <[email protected]>
- [PATCH 09/40] mm: optimize gfp_to_rank()
- From: Peter Zijlstra <[email protected]>
- [PATCH 33/40] uml: enable scsi and add iscsi config
- From: Peter Zijlstra <[email protected]>
- [PATCH 28/40] nfs: enable swap on NFS
- From: Peter Zijlstra <[email protected]>
- [PATCH 08/40] mm: kmem_cache_objsize
- From: Peter Zijlstra <[email protected]>
- [PATCH 27/40] nfs: disable data cache revalidation for swapfiles
- From: Peter Zijlstra <[email protected]>
- [PATCH 11/40] net: wrap sk->sk_backlog_rcv()
- From: Peter Zijlstra <[email protected]>
- [PATCH 18/40] netvm: prevent a TCP specific deadlock
- From: Peter Zijlstra <[email protected]>
- [PATCH 31/40] mm: balance_dirty_pages() vs throttle_vm_writeout() deadlock
- From: Peter Zijlstra <[email protected]>
- [PATCH 37/40] iscsi: ensure the iscsi kernel fd is not usable in userspace
- From: Peter Zijlstra <[email protected]>
- [PATCH 35/40] From: Mike Christie <[email protected]>
- From: Peter Zijlstra <[email protected]>
- [PATCH 23/40] mm: add support for non block device backed swap files
- From: Peter Zijlstra <[email protected]>
- [PATCH 29/40] nfs: fix various memory recursions possible with swap over NFS.
- From: Peter Zijlstra <[email protected]>
- [PATCH 36/40] iscsi: fixup of the ep_connect patch
- From: Peter Zijlstra <[email protected]>
- [PATCH 22/40] mm: prepare swap entry methods for use in page methods
- From: Peter Zijlstra <[email protected]>
- [PATCH 10/40] selinux: tag avc cache alloc as non-critical
- From: Peter Zijlstra <[email protected]>
- [PATCH 14/40] netvm: link network to vm layer
- From: Peter Zijlstra <[email protected]>
- [PATCH 04/40] mm: serialize access to min_free_kbytes
- From: Peter Zijlstra <[email protected]>
- [PATCH 01/40] mm: page allocation rank
- From: Peter Zijlstra <[email protected]>
- [PATCH 25/40] nfs: remove mempools
- From: Peter Zijlstra <[email protected]>
- [PATCH 32/40] block: add a swapdev callback to the request_queue
- From: Peter Zijlstra <[email protected]>
- [PATCH 40/40] iscsi: support for swapping over iSCSI.
- From: Peter Zijlstra <[email protected]>
- [PATCH 39/40] mm: a process flags to avoid blocking allocations
- From: Peter Zijlstra <[email protected]>
- [PATCH 20/40] netvm: skb processing
- From: Peter Zijlstra <[email protected]>
- [PATCH 21/40] uml: rename arch/um remove_mapping()
- From: Peter Zijlstra <[email protected]>
- [PATCH 19/40] netfilter: notify about NF_QUEUE vs emergency skbs
- From: Peter Zijlstra <[email protected]>
- [PATCH 13/40] net: sk_allocation() - concentrate socket related allocations
- From: Peter Zijlstra <[email protected]>
- [PATCH 24/40] mm: methods for teaching filesystems about PG_swapcache pages
- From: Peter Zijlstra <[email protected]>
- [PATCH 03/40] mm: allow PF_MEMALLOC from softirq context
- From: Peter Zijlstra <[email protected]>
- [PATCH 16/40] netvm: hook skb allocation to reserves
- From: Peter Zijlstra <[email protected]>
- [PATCH 34/40] sock: safely expose kernel sockets to userspace
- From: Peter Zijlstra <[email protected]>
- [PATCH 38/40] netlink: add SOCK_VMIO support to AF_NETLINK
- From: Peter Zijlstra <[email protected]>
- [PATCH 12/40] net: packet split receive api
- From: Peter Zijlstra <[email protected]>
- [PATCH 30/40] nfs: fixup missing error code
- From: Peter Zijlstra <[email protected]>
- [PATCH 15/40] netvm: INET reserves.
- From: Peter Zijlstra <[email protected]>
- [PATCH 07/40] mm: allow mempool to fall back to memalloc reserves
- From: Peter Zijlstra <[email protected]>
- [PATCH 02/40] mm: slab allocation fairness
- From: Peter Zijlstra <[email protected]>
- Re: [PATCH 00/40] Swap over Networked storage -v12
- Prev by Date: [PATCH 20/40] netvm: skb processing
- Next by Date: [PATCH 39/40] mm: a process flags to avoid blocking allocations
- Previous by thread: cpufreq longhaul locks up
- Next by thread: [PATCH 02/40] mm: slab allocation fairness
- Index(es):