Re: [PATCH] Remove softlockup from invalidate_mapping_pages.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



NeilBrown <[email protected]> wrote:
>
> The following patch fixes a problem with invalidate_mapping_pages.
> Please look at the patch description and then come back here, because
> there are some things I don't understand which you might be able to
> help me with.
> ...
> 
> Thanks. 
> I have had 2 reports of softlockups in this code apparently related to md. 
> md calls invalidate_bdev on all the component block devices that it is 
> building into an array.  These block devices are very likely to have just one
> page in their mapping, right near the end as mdadm will have read the 
> superblock which lives near the end.
> 
> However, I cannot see why the page would be locked.
> Being locked for read is very unlikely because mdadm would have already
> read the superblock.  I guess locked for read-ahead might be possible
> (I assume readahead does lock the pages) but as only one or maybe two reads 
> are a performed by mdadm, not much readahead should be generated.
> 
> Being locked for write also seems unlikely as if mdadm were to write
> (which is fairly unlikely but not impossible) it would fsync() straight
> away so by the time that it comes to assemble the array, all io
> should have finished.
> 
> So that is (1) - I don't see why the page would be locked.

Dunno.   memory reclaim could lock the page, but that'd be pretty rare.

> And (2) - I have a report (on linux-raid) of a soft-lockup which
> lasted 76 seconds! 
> Now if the device was 100Gig, that is 25million page addresses or 
> 3microseconds per loop. Is that at all likely for this loop - it 
> does take and drop a spinlock but could that come close to a
> few thousand cycles?

Maybe this CPU is holding some lock which pervents the page-unlocking thread
from unlocking the page?

You could perhaps do

	page->who_locked_me = __builtin_return_address(<something>);

in lock_page().  And in TestSetPageLocked().  It'd take some poking. 
Alternatively a well-timed sysrq-T might tell you.


As you point out, the loop will count up to page->index one-at-a-time.  3us
sounds rather heavy, but the radix-tree gang-lookup code isn't terribly
efficient, and the tree will be deep.

> And the processor in this case was a dual-core amd64 - with SMP enabled.
> I can imaging a long lockup on a uniprocessor, but if a second processor
> core is free to unlock the page when the IO (Whatever it is) completes,
> a 76 second delay would be unexpected.

yup.

> The original bug report can be found at
>   http://marc.theaimsgroup.com/?l=linux-raid&m=114550096908177&w=2
> 
> Finally (3) - I think that invalidate_mapping_pages should probably
> have a cond_resched() call in it, except that drop_pagecache_sb in
> fs/drop_caches.c calls it with the "inode_lock" spinlock held, which
> would be bad.

Yes, we used to have a cond_resched() in there, but I took it out, for that
reason.  I'm bad.

  Would it be safe (or could it be made safe) to drop and
> regain the lock around that call?

Hard.  We're walking a long list-head chain which is pinned by that lock.

> If invalidate_mapping_pages is called to invalidate a very large
> mapping (e.g. a very large block device) and if the only active page
> in that device is near the end  (or at least, at a very large  index),
> such as, say, the superblock of an md array, and if that page
> happens to be locked when invalidate_mapping_pages is called,
> then
>   pagevec_lookup will return this page and
>   as it is locked, 'next' will be incremented and pagevec_lookup
>   will be called again. and again. and again.
>   while we count from 0 upto a very large number.
> 
> We should really always set 'next' to 'page->index+1' before going
> around the loop again, not just if the page isn't locked.
> 

OK.

> Cc: "Steinar H. Gunderson" <[email protected]>
> Signed-off-by: Neil Brown <[email protected]>
> 
> ### Diffstat output
>  ./mm/truncate.c |   10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff ./mm/truncate.c~current~ ./mm/truncate.c
> --- ./mm/truncate.c~current~	2006-04-20 15:27:22.000000000 +1000
> +++ ./mm/truncate.c	2006-04-20 15:38:20.000000000 +1000
> @@ -238,13 +238,11 @@ unsigned long invalidate_mapping_pages(s
>  		for (i = 0; i < pagevec_count(&pvec); i++) {
>  			struct page *page = pvec.pages[i];
>  
> -			if (TestSetPageLocked(page)) {
> -				next++;
> +			next = page->index+1;
> +
> +			if (TestSetPageLocked(page))
>  				continue;
> -			}
> -			if (page->index > next)
> -				next = page->index;
> -			next++;
> +
>  			if (PageDirty(page) || PageWriteback(page))
>  				goto unlock;
>  			if (page_mapped(page))

We're not supposed to look at page->index of an unlocked page.

In practice, I think it's OK - there's no _reason_ why anyone would want to
trash the ->index of a just-truncated page.  However I think it'd be saner
to a) only look at ->index after we've tried to lock the page and b) make
sure that ->index is really "to the right" of where we're currently at.

How's this look?

--- devel/mm/truncate.c~remove-softlockup-from-invalidate_mapping_pages	2006-04-20 00:20:49.000000000 -0700
+++ devel-akpm/mm/truncate.c	2006-04-20 00:28:18.000000000 -0700
@@ -230,14 +230,24 @@ unsigned long invalidate_mapping_pages(s
 			pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
 		for (i = 0; i < pagevec_count(&pvec); i++) {
 			struct page *page = pvec.pages[i];
+			pgoff_t index;
+			int locked;
 
-			if (TestSetPageLocked(page)) {
-				next++;
-				continue;
-			}
-			if (page->index > next)
-				next = page->index;
+			locked = TestSetPageLocked(page);
+
+			/*
+			 * We really shouldn't be looking at the ->index of an
+			 * unlocked page.  But we're not allowed to lock these
+			 * pages.  So we rely upon nobody altering the ->index
+			 * of this (pinned-by-us) page.
+			 */
+			index = page->index;
+			if (index > next)
+				next = index;
 			next++;
+			if (!locked)
+				continue;
+
 			if (PageDirty(page) || PageWriteback(page))
 				goto unlock;
 			if (page_mapped(page))
_


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux