Re: [PATCH] Fix a race condition between ->i_mapping and iput()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



OGAWA Hirofumi <[email protected]> wrote:
>
> This race became a cause of oops, and can reproduce by the following.
> 
>     while true; do
> 	dd if=/dev/zero of=/dev/.static/dev/hdg1 bs=512 count=1000 & sync
>     done
> 
> 
> This race condition was between __sync_single_inode() and iput().
> 
>           cpu0 (fs's inode)                 cpu1 (bdev's inode)
> ------------------------------------------------------------------------
>                                        close("/dev/hda2")
>                                        [...]
> __sync_single_inode()
>    /* copy the bdev's ->i_mapping */
>    mapping = inode->i_mapping;
> 
>                                        generic_forget_inode()
>                                           bdev_clear_inode()
> 					     /* restre the fs's ->i_mapping */
> 				             inode->i_mapping = &inode->i_data;
> 				          /* bdev's inode was freed */
>                                           destroy_inode(inode);
> 
>    if (wait) {
>       /* dereference a freed bdev's mapping->host */
>       filemap_fdatawait(mapping);  /* Oops */
> 
> Since __sync_signle_inode() is only taking a ref-count of fs's inode,
> the another process can be close() and freeing the bdev's inode while
> writing fs's inode.  So, __sync_signle_inode() accesses the freed
> ->i_mapping, oops.
> 
> This patch takes ref-count of bdev's inode for fs's inode before
> setting a ->i_mapping, and the clear_inode() of fs's inode does iput().
> So, if fs's inode is still living, bdev's inode shouldn't be freed.
> 

OK, so to rephrase:

Whenever /dev/sda's inode->i_mapping points at a kernel-internal blockdev
inode's i_data, we will hold a ref on that blockdev inode, to pin its
i_data.  That sounds sane.


> 
> diff -puN fs/block_dev.c~i_mapping-race-fix-2 fs/block_dev.c
> --- linux-2.6/fs/block_dev.c~i_mapping-race-fix-2	2006-03-11 16:19:07.000000000 +0900
> +++ linux-2.6-hirofumi/fs/block_dev.c	2006-03-11 17:52:11.000000000 +0900
> @@ -418,21 +418,31 @@ EXPORT_SYMBOL(bdput);
>  static struct block_device *bd_acquire(struct inode *inode)
>  {
>  	struct block_device *bdev;
> +
>  	spin_lock(&bdev_lock);
>  	bdev = inode->i_bdev;
> -	if (bdev && igrab(bdev->bd_inode)) {
> +	if (bdev) {
> +		atomic_inc(&bdev->bd_inode->i_count);

No longer checking (I_FREEING|I_WILL_FREE).  Why was this change included?

>  		spin_unlock(&bdev_lock);
>  		return bdev;
>  	}
>  	spin_unlock(&bdev_lock);
> +
>  	bdev = bdget(inode->i_rdev);
>  	if (bdev) {
>  		spin_lock(&bdev_lock);
> -		if (inode->i_bdev)
> -			__bd_forget(inode);
> -		inode->i_bdev = bdev;
> -		inode->i_mapping = bdev->bd_inode->i_mapping;
> -		list_add(&inode->i_devices, &bdev->bd_inodes);
> +		if (!inode->i_bdev) {
> +			/*
> +			 * We take an additional bd_inode->i_count for inode,
> +			 * and it's released in clear_inode() of inode.
> +			 * So, we can access it via ->i_mapping always
> +			 * without igrab().
> +			 */
> +			atomic_inc(&bdev->bd_inode->i_count);
> +			inode->i_bdev = bdev;
> +			inode->i_mapping = bdev->bd_inode->i_mapping;
> +			list_add(&inode->i_devices, &bdev->bd_inodes);
> +		}

And why this change?  The removal of the __bd_forget() and the changed
behaviour if (inode->i_bdev != NULL)?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux