On Tue, 10 Oct 2006 16:11:45 +0200
Jan Kara <[email protected]> wrote:
> > On Mon, Oct 09, 2006 at 02:59:25PM -0700, Badari Pulavarty wrote:
> >
> > > journal_dirty_data() would do submit_bh() ONLY if its part of the older
> > > transaction.
> > >
> > > I need to take a closer look to understand the race.
> > >
> > > BTW, is this 1k or 2k filesystem ?
> >
> > (18:41:11:davej@gelk:~)$ sudo tune2fs -l /dev/md0 | grep size
> > Block size: 1024
> > Fragment size: 1024
> > Inode size: 128
> > (18:41:16:davej@gelk:~)$
> >
> > > How easy is to reproduce the problem ?
> >
> > I can reproduce it within a few hours of stressing, but only
> > on that one box. I've not figured out exactly what's so
> > special about it yet (though the 1k block thing may be it).
> > I had been thinking it was a raid0 only thing, as none of
> > my other boxes have that.
> >
> > I'm not entirely sure how it got set up that way either.
> > The Fedora installer being too smart for its own good perhaps.
> I think it's really the 1KB block size that makes it happen.
> I've looked at journal_dirty_data() code and I think the following can
> happen:
> sync() eventually ends up in journal_dirty_data(bh) as Eric writes.
> There is finds dirty buffer attached to the comitting transaction. So it drops
> all locks and calls sync_dirty_buffer(bh).
> Now in other process, file is truncated so that 'bh' gets just after EOF.
> As we have 1kb buffers, it can happen that bh is in the partially
> truncated page. Buffer is marked unmapped and clean. But in a moment the page
> is marked dirty and msync() is called. That eventually calls
> set_page_dirty() and all buffers in the page are marked dirty.
> The first process now wakes up, locks the buffer, clears the dirty bit
> and does submit_bh() - Oops.
>
> This is essentially the same problem Badari found but in a different
> place. There are two places that are arguably wrong...
> 1) We mark buffer dirty after EOF. But actually that may be needed -
> or what is the expected behaviour when we write into mmapped file after
> EOF, then extend the file and do msync()?
yup.
> 2) We submit a buffer after EOF for IO. This should be clearly avoided
> but getting the needed info from bh is really ugly...
Things like __block_write_full_page() avoid this by checking the block's
offset against i_size. (Not racy against truncate-down because the page is
locked, not racy against truncate-up because the bh is zero and
up-to-date).
But for jbd writeout we don't hold the page lock, so checking against
bh->b_page->host->i_size is a bit racy.
hm. But we do lock the buffers in journal_invalidatepage(), so checking
i_size after locking the buffer in the writeout path might get us there.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]