Re: [RFC][PATCH] VM deadlock prevention core -v3

On Thu, Aug 10, 2006 at 04:46:31PM +0200, Peter Zijlstra ([email protected]) wrote:
> On Thu, 2006-08-10 at 18:02 +0400, Evgeniy Polyakov wrote:
> > On Thu, Aug 10, 2006 at 03:32:49PM +0200, Peter Zijlstra ([email protected]) wrote:
> > > Hi,
> > 
> > Hello, Peter.
> > 
> > > So I try again, please tell me if I'm still on crack and should go detox.
> > > However if you do so, I kindly request some words on the how and why of it.
> > 
> > I think you should talk with doctor in that case, but not with kernel
> > hackers :)
> > 
> > I have some comments about implementation, not overall design, since we
> > have slightly diametral points of view there.
> > 
> > ....
> > 
> > > --- linux-2.6.orig/net/core/skbuff.c
> > > +++ linux-2.6/net/core/skbuff.c
> > > @@ -43,6 +43,7 @@
> > >  #include <linux/kernel.h>
> > >  #include <linux/sched.h>
> > >  #include <linux/mm.h>
> > > +#include <linux/pagemap.h>
> > >  #include <linux/interrupt.h>
> > >  #include <linux/in.h>
> > >  #include <linux/inet.h>
> > > @@ -125,6 +126,8 @@ EXPORT_SYMBOL(skb_truesize_bug);
> > >   *
> > >   */
> > >  
> > > +#define ceiling_log2(x)	fls((x) - 1)
> > > +
> > >  /**
> > >   *	__alloc_skb	-	allocate a network buffer
> > >   *	@size: size to allocate
> > > @@ -147,6 +150,59 @@ struct sk_buff *__alloc_skb(unsigned int
> > >  	struct sk_buff *skb;
> > >  	u8 *data;
> > >  
> > > +	size = SKB_DATA_ALIGN(size);
> 
> I moved it here.

Yep.

> > > +
> > > +	if (gfp_mask & __GFP_MEMALLOC) {
> > > +		/*
> > > +		 * Fallback allocation for memalloc reserves.
> > > +
> 
> 		 * This allocator is build on alloc_pages() so that freed
> 		 * skbuffs return to the memalloc reserve imediately. SLAB
> 		 * memory might not ever be returned.
> 
> This was missing,... 
> 
> > > +		 * the page is populated like so:
> > > +		 *
> > > +		 *   struct sk_buff
> > > +		 *   [ struct sk_buff ]
> > > +		 *   [ atomic_t ]
> > > +		 *   unsigned int
> > > +		 *   struct skb_shared_info
> > > +		 *   char []
> > > +		 *
> > > +		 * We have to do higher order allocations for icky jumbo
> > > +		 * frame drivers :-(. They really should be migrated to
> > > +		 * scather/gather DMA and use skb fragments.
> > > +		 */
> > > +		unsigned int data_offset =
> > > +			sizeof(struct sk_buff) + sizeof(unsigned int);
> > > +		unsigned long length = size + data_offset +
> > > +			sizeof(struct skb_shared_info);
> > > +		unsigned int pages;
> > > +		unsigned int order;
> > > +		struct page *page;
> > > +		void *kaddr;
> > > +
> > > +		/*
> > > +		 * Force fclone alloc in order to fudge a lacking in skb_clone().
> > > +		 */
> > > +		fclone = 1;
> > > +		if (fclone) {
> > > +			data_offset += sizeof(struct sk_buff) + sizeof(atomic_t);
> > > +			length += sizeof(struct sk_buff) + sizeof(atomic_t);
> > > +		}
> > > +		pages = (length + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > +		order = ceiling_log2(pages);
> > > +		skb = NULL;
> > > +		if (!(page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order)))
> > > +			goto out;
> > > +
> > > +		kaddr = pfn_to_kaddr(page_to_pfn(page));
> > > +		skb = (struct sk_buff *)kaddr;
> > > +
> > > +		*((unsigned int *)(kaddr + data_offset -
> > > +					sizeof(unsigned int))) = order;
> > > +		data = (u8 *)(kaddr + data_offset);
> > > +
> > 
> > Tricky, but since you are using own allocator here, you could change it to
> > be not so aggressive - i.e. do not round size to number of pages.
> 
> I'm not sure I follow you, I'm explicitly using
> alloc_pages()/free_page(), if
> I were to go smart here, I'd loose the whole reason for doing so.

You can use page to put there several skbs for example or at least add
there a fclone (fast clone).

> > 
> > > +		goto allocated;
> > > +	}
> > > +
> > >  	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;
> > >  
> > >  	/* Get the HEAD */
> > > @@ -155,12 +211,13 @@ struct sk_buff *__alloc_skb(unsigned int
> > >  		goto out;
> > >  
> > >  	/* Get the DATA. Size must match skb_add_mtu(). */
> > > -	size = SKB_DATA_ALIGN(size);
> > 
> > Bad sign.
> 
> See above.

Yep, I've found.

> > >  	data = ____kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> > >  	if (!data)
> > >  		goto nodata;
> > >  
> > > +struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
> > > +		unsigned length, gfp_t gfp_mask)
> > > +{
> > > +	struct sk_buff *skb;
> > > +
> > > +	WARN_ON(gfp_mask & (__GFP_NOMEMALLOC | __GFP_MEMALLOC));
> > > +	gfp_mask &= ~(__GFP_NOMEMALLOC | __GFP_MEMALLOC);
> > > +
> > > +	skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_NOMEMALLOC);
> > > +	if (skb)
> > > +		goto done;
> > > +
> > > +	if (atomic_read(&dev->rx_reserve_used) >=
> > > +			dev->rx_reserve * dev->memalloc_socks)
> > > +		goto out;
> > > +
> > > +	/*
> > > +	 * pre-inc guards against a race with netdev_wait_memalloc()
> > > +	 */
> > > +	atomic_inc(&dev->rx_reserve_used);
> > > +	skb = ___netdev_alloc_skb(dev, length, gfp_mask | __GFP_MEMALLOC);
> > > +	if (unlikely(!skb)) {
> > > +		atomic_dec(&dev->rx_reserve_used);
> > > +		goto out;
> > > +	}
> > 
> > Since you have added atomic operation in that path, you can use device's
> > reference counter instead and do not care that it can dissapear.
> 
> Is that the sole reason taking a reference on the device is bad?

Taking a reference is bad due to performance reasons, since atomic
increment is not that cheap. If you do it for one variable for the
purpose of reference counting you can use device's refcnt istead, which
will solve some races.

> > > @@ -434,6 +567,12 @@ struct sk_buff *skb_clone(struct sk_buff
> > >  		n->fclone = SKB_FCLONE_CLONE;
> > >  		atomic_inc(fclone_ref);
> > >  	} else {
> > > +		/*
> > > +		 * should we special-case skb->memalloc cloning?
> > > +		 * for now fudge it by forcing fast-clone alloc.
> > > +		 */
> > > +		BUG_ON(skb->memalloc);
> > > +
> > >  		n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
> > >  		if (!n)
> > >  			return NULL;
> > 
> > Ugh... cloning is a one of the shoulders of giant where Linux network
> > stack is staying...
> 
> Yes, I'm aware of that, I have a plan to fix this, however I haven't had
> time
> to implement it. My immediate concern is the point wrt. the net_device
> mapping.
> 
> My idea was: instead of the order, store the size, and allocate clone 
> skbuffs in the available room at the end of the page(s), allocating
> extra pages
> if needed.

You can check if requested skb with fclone fits allocated pages, and if
so use fclone magic, otherwise postpone clone allocation until it is
required.

> > > @@ -686,6 +825,8 @@ int pskb_expand_head(struct sk_buff *skb
> > >  	if (skb_shared(skb))
> > >  		BUG();
> > >  
> > > +	BUG_ON(skb->memalloc);
> > > +
> > >  	size = SKB_DATA_ALIGN(size);
> > >  
> > >  	data = kmalloc(size + sizeof(struct skb_shared_info), gfp_mask);
> > 
> > And that is a bug.
> > That operation can happen even with usual receiving processing.
> 
> Yes, more allocator work..
> 
> > > Index: linux-2.6/net/ipv4/af_inet.c
> > > ===================================================================
> > > --- linux-2.6.orig/net/ipv4/af_inet.c
> > > +++ linux-2.6/net/ipv4/af_inet.c
> > > @@ -132,6 +132,14 @@ void inet_sock_destruct(struct sock *sk)
> > >  {
> > >  	struct inet_sock *inet = inet_sk(sk);
> > >  
> > > +	if (sk_is_memalloc(sk)) {
> > > +		struct net_device *dev = ip_dev_find(inet->rcv_saddr);
> > > +		if (dev) {
> > > +			dev_adjust_memalloc(dev, -1);
> > > +			dev_put(dev);
> > > +		}
> > > +	}
> > > +
> > 
> > This looks very strange - you decrement reference counter both in socket
> > destruction code and in netdevice destruction code.
> 
> This is the right place; however I was not sure net_device destruction
> implied
> destruction of all related sockets; in case that is not so, you will
> have to
> clean up there, because this destructor will not find the device any
> longer.

Sockets can live without network devices at all, I expect it is enough
to clean up in socket destructor, since packets can come from
different devices into the same socket.

-- 
	Evgeniy Polyakov
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC][PATCH] VM deadlock prevention core -v3
  - From: Peter Zijlstra <[email protected]>

References:
- [RFC][PATCH] VM deadlock prevention core -v3
  - From: Peter Zijlstra <[email protected]>
- Re: [RFC][PATCH] VM deadlock prevention core -v3
  - From: Evgeniy Polyakov <[email protected]>
- Re: [RFC][PATCH] VM deadlock prevention core -v3
  - From: Peter Zijlstra <[email protected]>

Prev by Date: Re: [PATCH 3/14] Generic ioremap_page_range: alpha conversion
Next by Date: Re: 2.6.18-rc3-git3 - XFS - BUG: unable to handle kernel NULL pointer dereference at virtual address 00000078
Previous by thread: Re: [RFC][PATCH] VM deadlock prevention core -v3
Next by thread: Re: [RFC][PATCH] VM deadlock prevention core -v3
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]