Re: Temporary lockup on loopback block device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sat, 10 Nov 2007, Andrew Morton wrote:

> On Sat, 10 Nov 2007 20:51:31 +0100 (CET) Mikulas Patocka <[email protected]> wrote:
> 
> > Hi
> > 
> > I am experiencing a transient lockup in 'D' state with loopback device. It 
> > happens when process writes to a filesystem in loopback with command like
> > dd if=/dev/zero of=/s/fill bs=4k 
> > 
> > CPU is idle, disk is idle too, yet the dd process is waiting in 'D' in 
> > congestion_wait called from balance_dirty_pages.
> > 
> > After about 30 seconds, the lockup is gone and dd resumes, but it locks up 
> > soon again.
> > 
> > I added a printk to the balance_dirty_pages
> > printk("wait: nr_reclaimable %d, nr_writeback %d, dirty_thresh %d, 
> > pages_written %d, write_chunk %d\n", nr_reclaimable, 
> > global_page_state(NR_WRITEBACK), dirty_thresh, pages_written, 
> > write_chunk);
> > 
> > and it shows this during the lockup:
> > 
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > wait: nr_reclaimable 3099, nr_writeback 0, dirty_thresh 2985, 
> > pages_written 1021, write_chunk 1522
> > 
> > What apparently happens:
> > 
> > writeback_inodes syncs inodes only on the given wbc->bdi, however 
> > balance_dirty_pages checks against global counts of dirty pages. So if 
> > there's nothing to sync on a given device, but there are other dirty pages 
> > so that the counts are over the limit, it will loop without doing any 
> > work.
> > 
> > To reproduce it, you need totally idle machine (no GUI, etc.) -- if 
> > something writes to the backing device, it flushes the dirty pages 
> > generated by the loopback and the lockup is gone. If you add printk, don't 
> > forget to stop klogd, otherwise logging would end the lockup.
> 
> erk.
> 
> > The hotfix (that I verified to work) is to not set wbc->bdi, so that all 
> > devices are flushed ... but the code probably needs some redesign (i.e. 
> > either account per-device and flush per-device, or account-global and 
> > flush-global).
> > 
> > Mikulas
> > 
> > 
> > diff -u -r ../x/linux-2.6.23.1/mm/page-writeback.c mm/page-writeback.c
> > --- ../x/linux-2.6.23.1/mm/page-writeback.c     2007-10-12 18:43:44.000000000 +0200
> > +++ mm/page-writeback.c 2007-11-10 20:32:43.000000000 +0100
> > @@ -214,7 +214,6 @@
> > 
> > 	for (;;) {
> > 		struct writeback_control wbc = {
> > -			.bdi            = bdi,
> > 			.sync_mode      = WB_SYNC_NONE,
> > 			.older_than_this = NULL,
> > 			.nr_to_write    = write_chunk,
> 
> Arguably we just have the wrong backing-device here, and what we should do
> is to propagate the real backing device's pointer through up into the
> filesystem.  There's machinery for this which things like DM stacks use.

If you change loopback backing-device, you just turn this nicely 
reproducible example into a subtle race condition that can happen whenever 
you use loopback or not. Think, what happens when different process 
dirties memory:

You have process "A" that dirtied a lot of pages on device "1" but has not 
started writing them.
You have process "B" that is trying to write to device "2", sees dirty 
page count over limit, but can't do anything about it, because it is only 
allowed to flush pages on device "2". --- so it endlessly loops.

If you want to use the current flushing semantics, you just have to audit 
the whole kernel to make sure that if some process sees over-limit dirty 
page count, there is another process that is flushing the pages. Currently 
it is not true, the "dd" process sees over-limit count, but there is 
no-one writing.

> I wonder if the post-2.6.23 changes happened to make this problem go away.

I will try 2.6.24-rc2, but I don't think the root cause of this went away. 
Maybe you just reduced probability.

Mikulas
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux