Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file content corruption on ext3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Sun, 24 Dec 2006, Gordon Farquharson wrote:
> 
> Is there any way to provide any debugging information that may help
> solve the problem ?

I think we have people working on this. I know I'm trying to even come up 
with an idea of what is going on. I don't think we know yet.

> Would it help to know the nature of the corruption e.g. an analysis
> of the corruption in the file ?

I actually think we know that, because Andrei already gave details. The 
corruption seems to be basically a few pages that get zeroes at the end 
rather than the expected contents. That's consistent with the page being 
written out once, but then _not_ getting written out again despite being 
dirtied some more.

But if you see ay other pattern, please holler, because that would be 
interesting.

> BTW, I decided to try Linus's test program [1] on ARM (I don't think
> that anybody had tried it on ARM before).

You get the expected results, and in fact, I'd be very surprised if you 
didn't. It's something subtler than that going on.

I now _suspect_ that we're talking about something like

 - we started a writeout. The IO is still pending, and the page was 
   marked clean and is now in the "writeback" phase.
 - a write happens to the page, and the page gets marked dirty again. 
   Marking the page dirty also marks all the _buffers_ in the page dirty, 
   but they were actually already dirty, because the IO hasn't completed 
   yet.
 - the IO from the _previous_ write completes, and marks the buffers clean 
   again.

And no, thatr's not actually what is going on. The thing is, we actually 
clear the buffer dirty bits when we start the IO, not when we end it, but 
I think it is going to be this _kind_ of situation, where we missed 
something, and marked it clean too late, and thus cleared a dirty bit.

I don't think it's a page table issue any more, it just doesn't look 
likely with the ARM UP corruption. It's also not apparently even on a 
cacheline boundary, so it probably is really a dirty bit that got cleared 
wrogn due to some race with IO.

But right now we're all clueless. I personally suspect it's not even a new 
bug: it's probably an old bug that simply didn't matter before.
	
			Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Index of Archives]     [Kernel Newbies]     [Netfilter]     [Bugtraq]     [Photo]     [Stuff]     [Gimp]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Video 4 Linux]     [Linux for the blind]     [Linux Resources]
  Powered by Linux