Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io

On Fri, Oct 05, 2007 at 05:41:03PM +1000, David Chinner wrote:
> On Fri, Oct 05, 2007 at 11:36:52AM +0800, Fengguang Wu wrote:
> > On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> > > On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:

> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {    /* all-written or blockade... */
> >         if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
> >                 congestion_wait(WRITE, HZ/10);
> >         else                                           /* all-written! */
> >                 break;
> > }
> 
> >From this, if we have more_io on one superblock and we skip pages on a
> different superblock, the combination of the two will causes us to stop
> writeback for a while. Is this the right thing to do?

No, the two cases will occur at the same time to a super_block.
See below.

> > We can also read the whole background_writeout() logic as
> > 
> > while (!done) {
> >         /* sync _all_ sync-able data */
> >         congestion_wait(100ms);
> > }
> 
> To me it reads as:
> 
> 	while (!done) {
> 		/* sync all data or until one inode skips */
> 		congestion_wait(up to 100ms);
> 	}
> 
> and it ignores that we might have more superblocks with dirty data
> on them that we haven't flushed because we skipped pages on
> an inode on a different block device.

AFAIK, generic_sync_sb_inodes() will simply skip the inode in trouble
and _continue_ to sync other inodes:

                if (wbc->pages_skipped != pages_skipped) {
                        /*
                         * writeback is not making progress due to locked
                         * buffers.  Skip this inode for now.
                         */
                        redirty_tail(inode);
                }

Note that there's no "break" here.

> > Sure, the queues should be filled as fast as possible.
> > How fast can we fill the queue? Let's measure it:
> > 
> > //generated by the patch below
> > 
> > [  871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 29911 0 0 wc _M tw -12 sk 0
> > [  871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 28857 0 0 wc _M tw -12 sk 0
> > [  871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 27834 0 0 wc _M tw -12 sk 0
> > [  871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 26780 0 0 wc _M tw -12 sk 0
> > [  871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 25757 0 0 wc _M tw -12 sk 0
> > [  871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 24734 0 0 wc _M tw -12 sk 0
> > [  871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 23680 0 0 wc _M tw -12 sk 0
> > [  871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 22657 0 0 wc _M tw -12 sk 0
> > [  871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 21603 0 0 wc _M tw -12 sk 0
> > [  871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 20580 0 0 wc _M tw -12 sk 0
> > [  871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 19557 0 0 wc _M tw -12 sk 0
> > [  871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 global 18503 0 0 wc _M tw -12 sk 0
> > [  871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 global 17480 0 0 wc _M tw -12 sk 0
> > [  871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 global 16426 0 0 wc _M tw -12 sk 0
> > [  871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 global 15403 961 0 wc _M tw -12 sk 0
> > [  871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 global 14380 1550 0 wc _M tw -12 sk 0
> > [  871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 global 13326 2573 0 wc _M tw -12 sk 0
> > [  871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 global 12303 3224 0 wc _M tw -12 sk 0
> > [  871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 global 11280 4154 0 wc _M tw -12 sk 0
> > [  871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 global 10226 4929 0 wc _M tw -12 sk 0
> > [  871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 global 9203 5735 0 wc _M tw -12 sk 0
> > [  871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 global 8149 6603 0 wc _M tw -12 sk 0
> > [  871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 global 7126 7409 0 wc _M tw -12 sk 0
> > [  871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 global 6103 8246 0 wc _M tw -12 sk 0
> > [  871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 global 5049 9052 0 wc _M tw -12 sk 0
> > [  871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 global 4026 9982 0 wc _M tw -12 sk 0
> > [  871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 global 2972 10757 0 wc _M tw -12 sk 0
> > [  871.836728] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 26317 global 1949 11656 0 wc _M tw -12 sk 0
> > [  871.853286] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 25281 global 895 12431 0 wc _M tw -12 sk 0
> > [  871.868762] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 24245 global 58 13051 0 wc __ tw 168 sk 0
> > 
> > It's an Intel Core 2 2.93GHz CPU and a SATA disk.
> > The trace shows that
> > - there's no congestion_wait() called in wb_kupdate()
> > - it takes wb_kupdate() ~15ms to sync every 4MB 
> 
> But it takes a modern SATA disk ~40-50ms to write 4MB (80-100MB/s).
> IOWs, what you've timed above is a burst workload, not a steady
> state behaviour. And it actually shows that the elevator queues
> are growing in constrast to your goal of preventing them from
> growing.

My goal really? ;-)
 
> In more detail, the first half of the trace indicates no pages under
> writeback, that tends to imply that all I/O is complete by the
> time wb_kupdate is woken - it's been sucked into the drive
> cache as fast as possible.

Right.
 
> About half way through we start to see windup of the the number of
> pages under writeback of about 800-900 pages per printk.  That's
> 1024 pages minus 1 or 2 512k I/Os. This implies that the disk cache
> is now full and the disk has reached saturation. I/O is now
> being queued in the elevator. The last trace has 13051 pages under
> writeback, which at 128 pages per I/O is ~100 queued 512k I/Os.
> 
> The default queue depth with cfq is 128 requests, and IIRC it
> congests at 7/8s full, or 112 requests. IOWs, you file that you
> wrote was about 10MB short of what is needed to see congestion on
> your test rig.

Exactly.

wfg ~% cat /sys/block/sda/queue/nr_requests   
128
wfg ~% cat /sys/block/sda/queue/max_sectors_kb
512
 
More exactly, I was writing a huge file. It produces
balance_dirty_pages, background_writeout, and at last wb_kupdate. The
trace messages are collected after the copy completes, when
wb_kupdate() starts to sync the remaining (< background_thresh) data.

> So the trace shows we slept on neither congestion or more_io
> and it points towards congestion being the thing will typically
> block us on large file I/O. Before drawing any conclusions on
> whether wbc.more_io is needed or not, do you have any way of
> producing skipped pages when more_io is set?

No(and not that easy). (pages_skipped && more_io) events are rare indeed.

> > However, wb_kupdate() is syncing the data a bit slow(4*1000/15=266MB/s),
> > could it be because of a lot of cond_resched()?
> 
> You are using ext3? That would be my guess based simply on the write
> rate - ext3 has long been stuck at about that speed for buffered
> writes even on much faster block devices.  If I'm right, try using
> XFS and see how much differently it behaves. I bet you hit
> congestion much sooner than you expect. ;)

Yes, I was running ext3.  It seems that XFS is about the same speed:

[ 1427.278454] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 37974 global 16727 0 0 wc _M tw -4 sk 0
[ 1427.293653] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 36946 global 15704 0 0 wc _M tw -3 sk 0
[ 1427.308891] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 35919 global 14650 0 0 wc _M tw -13 sk 0
[ 1427.322462] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34882 global 13937 0 0 wc _M tw 300 sk 0
[ 1427.338194] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 34158 global 12914 0 0 wc _M tw -9 sk 0
[ 1427.353473] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 33125 global 11860 0 0 wc _M tw -12 sk 0
[ 1427.362984] mm/page-writeback.c 668 wb_kupdate: pdflush(5606) 32089 global 11860 0 0 wc _M tw 1018 sk 0

That's 14ms per 4MB.  Maybe it's a VFS issue.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
  - From: David Chinner <[email protected]>
- Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
  - From: David Chinner <[email protected]>
- Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
  - From: David Chinner <[email protected]>
- Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
  - From: David Chinner <[email protected]>

Prev by Date: [code] Unlimited partitions, a try
Next by Date: [PATCH] vc bell config
Previous by thread: Re: [PATCH 5/5] writeback: introduce writeback_control.more_io to indicate more io
Next by thread: [PATCH 2/5] writeback: fix time ordering of the per superblock inode lists 8
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]