Re: [RFC] page lock ordering and OCFS2

Zach Brown <zach.brown@oracle.com> wrote:
>
> > So where is the lock inversion?
> > 
> > Perhaps if you were to cook up one of those little threadA/threadB ascii
> > diagrams we could see where the inversion occurs?
> 
> Yeah, let me give that a try.  I'll try to trim it down to the relevant
> bits.  First let's start with a totally fresh node and have a read get a
> read DLM lock and populate the page cache on this node:
> 
>  sys_read
>    generic_file_aio_read
>      ocfs2_readpage
>        ocfs2_data_lock
>        block_read_full_page
>        ocfs2_data_unlock
> 
> So it was only allowed to proceed past ocfs2_data_lock() towards
> block_read_full_page() once the DLM granted it a read lock.  As it calls
> ocfs2_data_unlock() it only is dropping this caller's local reference on
> the lock.  The lock still exists on that node and is still valid and
> holding data in the page cache until it gets a network message saying
> that another node, who is probably going to be writing, would like the
> lock dropped.
> 
> DLM kernel threads respond to the network messages and truncate the page
> cache.  While the thread is busy with this inode's lock other paths on
> that node won't be able get locks.  Say one of those messages arrives.
> While a local DLM thread is invalidating the page cache another user
> thread tries to read:
> 
> user thread				dlm thread
> 
> 
> 					kthread
> 					...
> 					ocfs2_data_convert_worker

                                        I assume there's an ocfs2_data_lock
                                        hereabouts?

> 					  truncate_inode_pages
> sys_read
>   generic_file_aio_read
>     * gets page lock
>     ocfs2_readpage
>       ocfs2_data_lock
>         (stuck waiting for dlm)
> 					    lock_page
> 					      (stuck waiting for page)
> 

Why does ocfs2_readpage() need to take ocfs2_data_lock?  (Is
ocfs2_data_lock a range-based read-lock thing, or what?)

> The user task holds a page lock while waiting for the DLM to allow it to
> proceed.  The DLM thread is preventing lock granting progress while
> waiting for the page lock that the user task holds.
> 
> I don't know how far to go in explaining what leads up to laying out the
> locking like this.  It is typical (and OCFS2 used to do this) to wait
> for the DLM locks up in file->{read,write} and pin them for the duration
> of the IO.  This avoids the page lock and DLM lock inversion problem,
> but it suffers from a host of other problems -- most fatally needing
> that vma walking to govern holding multiple DLM locks during an IO.

Oh.

Have you considered using invalidate_inode_pages() instead of
truncate_inode_pages()?  If that leaves any pages behind, drop the read
lock, sleep a bit, try again - something klunky like that might get you out
of trouble, dunno.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC] page lock ordering and OCFS2
  - From: Zach Brown <zach.brown@oracle.com>
- Re: [RFC] page lock ordering and OCFS2
  - From: Anton Altaparmakov <aia21@cam.ac.uk>

References:
- [RFC] page lock ordering and OCFS2
  - From: Zach Brown <zach.brown@oracle.com>
- Re: [RFC] page lock ordering and OCFS2
  - From: Andrew Morton <akpm@osdl.org>
- Re: [RFC] page lock ordering and OCFS2
  - From: Zach Brown <zach.brown@oracle.com>

Prev by Date: Re: [PATCH 1/1] Kconfig help text for RAM Disk & initrd
Next by Date: Re: 2.6.14-rc4-rt7
Previous by thread: Re: [RFC] page lock ordering and OCFS2
Next by thread: Re: [RFC] page lock ordering and OCFS2
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]