On Wed, May 16, 2007 at 01:37:26PM -0700, Andrew Morton wrote:
> On Wed, 16 May 2007 16:14:14 -0400
> Chris Mason <[email protected]> wrote:
>
> > On Wed, May 16, 2007 at 01:04:13PM -0700, Andrew Morton wrote:
> > > > The good news is that if you let it run long enough, the times
> > > > stabilize. The bad news is:
> > > >
> > > > create dir kernel-86 222MB in 15.85 seconds (14.03 MB/s)
> > > > create dir kernel-87 222MB in 28.67 seconds (7.76 MB/s)
> > > > create dir kernel-88 222MB in 18.12 seconds (12.27 MB/s)
> > > > create dir kernel-89 222MB in 19.77 seconds (11.25 MB/s)
> > >
> > > well hang on. Doesn't this just mean that the first few runs were writing
> > > into pagecache and the later ones were blocking due to dirty-memory limits?
> > >
> > > Or do you have a sync in there?
> > >
> > There's no sync, but if you watch vmstat you can clearly see the log
> > flushes, even when the overall create times are 11MB/s. vmstat goes
> > 30MB/s -> 4MB/s or less, then back up to 30MB/s.
>
> How do you know that it is a log flush rather than, say, pdflush
> hitting the blockdev inode and doing a big seeky write?
Ok, I did some more work to split out the two cases (block device inode
writeback and log flushing).
I patched jbd's log_do_checkpoint to put all the blocks it wanted to
write in a radix tree, then send them all down in order at the end.
The elevator should be helping here, but jbd is sending down 2,000
to 3,000 blocks during the checkpoint and upping nr_requests alone
didn't seem to be doing the trick.
Unpatched ext3 would break down into seeks after 8 kernel trees are
created (222MB each). With the radix sorting, the first 15 kernel trees
are created quickly, and then we slow down.
So I waited until around the 25th kernel tree was created, hit ctrl-c
and ran sync. vmstat showed writes going at 2MB/s, and sysrq-w showed
sync was running the block device inode for most of the 2MB/s period.
It looks as though the dirty pages on the block device inode are spread
out far enough that we're not getting good streaming writes. Mark
Fasheh ran on a bigger raid array, where performance was consistently
good for the whole run. I'm assuming the larger write cache on the
array was able to group the data writes with the metadata on disk, while
my poor little sata drive wasn't. Dave Chinner hinted that xfs is
probably suffering a similar problem, which is usually fixed by backing
the FS with stripes and big raid.
My vaporware FS is able to maintain speed through the run because the
allocator tries to keep data and metadata grouped into 256mb chunks,
and so they don't end up mingling on disk until things get full.
At any rate, it may be worth putzing with the writeback routines to try
and find dirty pages close by in the block dev inode when doing data
writeback. My guess is that ext3 should be going 1.5x to 2x faster for
this particular run, but that's a huge amount of complexity added so I'm
not convinced it is a great idea.
-chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]