Re: [RFC PATCH] Make use of PageMappedToDisk() to improve ext3

Badari Pulavarty <pbadari@us.ibm.com> wrote:
>
> Hi Andrew,
> 
> Here is my first pass at making use of PageMappedToDisk() to avoid
> ext3 prepare_write/commit_write handling + journaling.
> 
> I see significant improvements with my micro benchmarks for over-write
> (pages again & again) case. Does this look right to you ?
> 
> 	2.6.16-rc6	2.6.16-rc6+patch
> real	0m6.606s	0m3.705s
> user	0m0.124s	0m0.108s
> sys	0m6.456s	0m3.600s
> 

Those numbers are suspiciously good.  No I/O involved, so I spose it makes
sense.

I suggest you test this with fsx-linux on a 1k blocksize filesystem with lots
of pageout pressure happening.  With a combination of mmapped writes and
write()s.

> 
> Make use of PageMappedToDisk(page) to find out if we need to
> block allocation and skip the calls to it, if not needed.
> When we are not doing block allocation, also avoid calls
> to journal start and adding buffers to transaction.

I worry about this:

> ===================================================================
> --- linux-2.6.16-rc6.orig/fs/buffer.c	2006-03-11 14:12:55.000000000 -0800
> +++ linux-2.6.16-rc6/fs/buffer.c	2006-03-15 13:05:26.000000000 -0800
> @@ -2051,8 +2051,10 @@ static int __block_commit_write(struct i
>  	 * the next read(). Here we 'discover' whether the page went
>  	 * uptodate as a result of this (potentially partial) write.
>  	 */
> -	if (!partial)
> +	if (!partial) {
>  		SetPageUptodate(page);
> +		SetPageMappedToDisk(page);
> +	}
>  	return 0;
>  }

We're setting SetPageMappedToDisk here if all the buffers are uptodate. 
But that doesn't mean they're mapped to disk!  They could be over file
holes and they were brought uptodate with memset.

Are you really sure that we're clearing PageMappedToDisk in all the right
places?  Truncate...

> Index: linux-2.6.16-rc6/fs/ext3/inode.c
> ===================================================================
> --- linux-2.6.16-rc6.orig/fs/ext3/inode.c	2006-03-11 14:12:55.000000000 -0800
> +++ linux-2.6.16-rc6/fs/ext3/inode.c	2006-03-15 13:09:14.000000000 -0800
> @@ -999,6 +999,13 @@ static int ext3_prepare_write(struct fil
>  	handle_t *handle;
>  	int retries = 0;
>  
> +
> +	/*
> + 	 * If the page is already mapped to disk and we are not
> +	 * journalling the data - there is nothing to do.
> +	 */
> +	if (PageMappedToDisk(page) && !ext3_should_journal_data(inode))
> +		return 0;

Yipes.  So we're not attaching the buffers to the journal at all in this
case.  So if nothing has been physically written to the disk yet, a
crash+recovery will expose unwritten-to disk blocks.

OK, that might not happen with this patch because __block_commit_write() is
presently checking BH_uptodate rather than BH_mapped.  And a file hole will
still be a file hole after the crash+recovery.  But if
__block_commit_write() is changed to test buffer_mapped(), I _think_ we can
leak uninitialised data on recovery.

Or maybe you thought all this through and didn't tell me :(

Either way, this optimisation (not attaching the ordered-mode buffers to
the transaction) worries me.  It's largely unrelated to the
PageMappedToDisk() optimisation and should be coded and thought about
separately.  It's a significant change in ordered-mode ext3 semantics.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: [RFC PATCH] Make use of PageMappedToDisk() to improve ext3
  - From: Badari Pulavarty <pbadari@us.ibm.com>

References:
- [RFC PATCH] Make use of PageMappedToDisk() to improve ext3
  - From: Badari Pulavarty <pbadari@us.ibm.com>

Prev by Date: time_interpolator: add __read_mostly
Next by Date: RE: libata/sata_nv latency on NVIDIA CK804 [was Re: AMD64 X2 lost ticks on PM timer]
Previous by thread: [RFC PATCH] Make use of PageMappedToDisk() to improve ext3
Next by thread: Re: [RFC PATCH] Make use of PageMappedToDisk() to improve ext3
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]