Re: 2.6.17-rc6-mm1 -- BUG: possible circular locking deadlock detected!

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sat, 10 Jun 2006, Ingo Molnar wrote:
> * Anton Altaparmakov <[email protected]> wrote:
> > The normal access pattern (B) where the runlist lock is always taken 
> > first.  And the mft record lock is taken second and only if the 
> > runlist is incomplete in-memory.
> > 
> > Of course on file modification, this is also the case, the runlist 
> > lock is taken first, then the mft record lock is taken and thus both 
> > the runlist and the inode can be updated with the new data (e.g. on a 
> > file extend).
> 
> thanks for the detailed explanation!

You are welcome.  NTFS is not exactly simple code and its locking is 
certainly not simple at all...

> I have annotated the code for the lock validator as much as i could, by:
> 
> - excluding ntfs_fill_super() from the locking rules,
> 
> - 'splitting' the MFT's mrec_lock and runlist->lock locking rules from 
>   the other inodes's locking rules,
> 
> - splitting the mrec_lock rules of extent inodes. (We map them
>   recursively while having the main inode mft record mapped. The nesting
>   is safe because inode->extent_inode is a noncircular relation.)
 
Sounds good!

> Still there seems to be a case that the validator does not grok: 
> load_attribute_list() seems to take the lock in the opposite order from 
> what you described above. What locking detail am i missing? [let me know 
> if you need all dependency events leading up to this message from the 
> validator]

Ah, yes, I had completely forgotten about that one, sorry.  This is 
another special case I am afraid.  Note how there are two runlists in an 
ntfs_inode.  We have:

ntfs_inode->runlist - This is the runlist for the inode data (for files 
this is the actual file data, for directories this is the directory 
B-tree, for index inodes this is the view index B-tree, for attribute 
inodes this is the attribute data).  This is what we have been dealing 
with before and what you have already solved by the sounds of things.

ntfs_inode->attr_list_rl - This is the runlist for the attribute list 
attribute of a base inode, i.e. it does not exist for index inodes or 
attribute inodes, it only exists for regular files and directories (and 
once we implement symlinks, sockets, fifos and device special files for 
those, too).  Further it only exists for inodes which have extent inodes 
and those are definitely in the minority but that is irrelevant as we have 
to support it none the less.  The attribute list attribute is a linear 
list or better perhaps to describe it as an array of variable size 
records, and each record describes and attribute belonging to the inode, 
which part of the attribute this attribute extent describes and where 
(i.e. in which mft record) this attribute extent belongs.  As such the 
attribute list attribute is only ever used when the mft record is mapped 
and an attribute is being looked up and also when a new attribute is 
created or an attribute is deleted or an attribute is truncated in which 
case the attribute list must be updated.  In all those cases the mft 
record is mapped.  Thus yes in this case the locking is inverted but that 
is because ntfs_inode->attr_list_rl->lock will always be taken with the 
mft record lock held.

I think the lock validator has the problem of not knowing that there are 
two different types of runlist which is why it complains about it.

Another thing to note is that load_attribute_list() is only called at 
iget() time thus we have exclusive access to the inode so it is a special 
case in any case.

And at present the driver in the kernel does not ever modify the attribute 
list attribute (and hence does not modify its runlist either) thus the 
only time the attr_list_rl is used is at load_attribute_list() time.

Something I am noticing now is that given that the attribute list runlist 
is only ever accessed under the mft record lock it is in fact pointless to 
lock it at all as the mft record lock is an exclusive lock.

The reason I never noticed this before is that I started off having the 
mft record lock being a read-write lock so multiple processes could be 
read-accessing the mft record so I wanted to keep the attribute list 
locked as well.

I have to look through the code to make sure but I think we could simply 
not take the runlist lock when accessing the attr_list_rl and then the 
problem would go away.  The only thing is that when we start modifying the 
runlist attr_list_rl we will use functions that are common for the two 
runlist code paths and they may take the lock.  Not a huge problem as they 
can always use the _nolock version.

So I suggest if it is easy, to teach the validator to not complain about 
this as that is the order the locking is meant to happen and it is not 
circular because it is two different types of runlist being locked.  That 
gives me the possibility of later on turning the mft record lock back into 
a read-write lock to improve parallelism when reading data on ntfs...

If that is hard then you or I (or both) need to investigate whether the 
current driver can really simply not take the runlist lock when accessing 
attr_list_rl i.e. we need to make sure it always happens under the 
protection of the mft record lock and then the problem will go away.

Best regards,

	Anton

> =======================================================
> [ INFO: possible circular locking dependency detected ]
> -------------------------------------------------------
> ls/2581 is trying to acquire lock:
>  (&rl->lock){----}, at: [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
> 
> but task is already holding lock:
>  (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0
> 
> which lock already depends on the new lock,
> which could lead to circular locking dependencies.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&ni->mrec_lock){--..}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0346183>] mutex_lock_nested+0x73/0x2a0
>        [<c01d5e43>] map_mft_record+0x53/0x2c0
>        [<c01c54f8>] ntfs_map_runlist_nolock+0x3d8/0x530
>        [<c01c5bc1>] ntfs_map_runlist+0x41/0x70
>        [<c01c1929>] ntfs_readpage+0x8c9/0x9b0
>        [<c0142ffc>] read_cache_page+0xac/0x150
>        [<c01e212d>] ntfs_statfs+0x41d/0x660
>        [<c0163254>] vfs_statfs+0x54/0x70
>        [<c0163288>] vfs_statfs64+0x18/0x30
>        [<c0163384>] sys_statfs64+0x64/0xa0
>        [<c0347dcd>] sysenter_past_esp+0x56/0x8d
> 
> -> #0 (&rl->lock){----}:
>        [<c01394df>] lock_acquire+0x6f/0x90
>        [<c0134c8a>] down_read_nested+0x2a/0x40
>        [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
>        [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
>        [<c01d4735>] ntfs_iget+0x55/0x80
>        [<c01db3da>] ntfs_lookup+0x14a/0x740
>        [<c01736b6>] do_lookup+0x126/0x150
>        [<c0173ef3>] __link_path_walk+0x813/0xe50
>        [<c017457c>] link_path_walk+0x4c/0xf0
>        [<c0174a2d>] do_path_lookup+0xad/0x260
>        [<c0175228>] __user_walk_fd+0x38/0x60
>        [<c016e3be>] vfs_lstat_fd+0x1e/0x50
>        [<c016e401>] vfs_lstat+0x11/0x20
>        [<c016ec04>] sys_lstat64+0x14/0x30
>        [<c0347dcd>] sysenter_past_esp+0x56/0x8d
> 
> other info that might help us debug this:
> 
> 2 locks held by ls/2581:
>  #0:  (&inode->i_mutex){--..}, at: [<c0346108>] mutex_lock+0x8/0x10
>  #1:  (&ni->mrec_lock){--..}, at: [<c01d50c5>] map_mft_record_type+0x55/0x2d0
> 
> stack backtrace:
>  [<c0104bf2>] show_trace+0x12/0x20
>  [<c0104c19>] dump_stack+0x19/0x20
>  [<c0136ef1>] print_circular_bug_tail+0x61/0x70
>  [<c01389ff>] __lock_acquire+0x74f/0xde0
>  [<c01394df>] lock_acquire+0x6f/0x90
>  [<c0134c8a>] down_read_nested+0x2a/0x40
>  [<c01c1f5b>] load_attribute_list+0xfb/0x3c0
>  [<c01d323e>] ntfs_read_locked_inode+0xcee/0x15d0
>  [<c01d4735>] ntfs_iget+0x55/0x80
>  [<c01db3da>] ntfs_lookup+0x14a/0x740
>  [<c01736b6>] do_lookup+0x126/0x150
>  [<c0173ef3>] __link_path_walk+0x813/0xe50
>  [<c017457c>] link_path_walk+0x4c/0xf0
>  [<c0174a2d>] do_path_lookup+0xad/0x260
>  [<c0175228>] __user_walk_fd+0x38/0x60
>  [<c016e3be>] vfs_lstat_fd+0x1e/0x50
>  [<c016e401>] vfs_lstat+0x11/0x20
>  [<c016ec04>] sys_lstat64+0x14/0x30
>  [<c0347dcd>] sysenter_past_esp+0x56/0x8d

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux