Re: msync() behaviour broken for MS_ASYNC, revert patch?

Linus Torvalds wrote:

On Sat, 11 Feb 2006, Nick Piggin wrote:
What do you mean by overlapping?
I'm just talking about the "same area gets re-dirtied while it's alreadybusy being written". Depending on the _program_, this:


OK, that's what I thought.

- So let's say that you append to the log, and chose (for some reason,never mind) to let the kernel know. So you effectively do somethinglike
		memcpy(logptr, newentry, newentrysize);
		logptr = logptr + newentrysize;
		if (time_to_msync) {
			msync(msyncptr, logptr - msyncptr, MS_ASYNC);
			msyncptr = logptr;
		}
Ok?
Now, the question is, what do we want to happen at the MS_ASYNC.


Well it is all very well to just make up this case but I don't
see what it proves (eg. some application may actually work better
if creat is implemented with unlink to be extreme). But I'll try
to humour you :)

Being a logging program it would appear to have some data integrity
requirements, and as such I would guess it is going to use MS_SYNC
in the very near future before writing another entry to the log
(in case a crash happens while generating the next entry).

However the fact the MS_ASYNC is even called in the first place
indicates to me that there must be some window before the MS_SYNC
point (for whatever reason). So I would really want MS_ASYNC to
actually send the page to backing store asap in order to get some
pipelining going.

In particular, what happens if the _previous_ MS_ASYNC had started the IO(either directly, like in your world, or by bdflush just picking it up,it really doesn't matter) on the _previous_ old end of the log area, sothe partial page at the old "msyncptr" point may actually be under IOstill.


I wouldn't expect any IO there at all because there would be no
"dangling" MS_ASYNC, and there would be no random clowns writing
to our very important log. But just on the off chance that there
was some IO going on:

We have multiple choices:
- we ignore the issue (which is what the current behaviour for MS_ASYNCis, since it just marks things dirty in the page cache)- we mark the page dirty, but we don't start IO on it, since it's busy(and since it's dirty, it will _eventually_ get written out)
 - we actually wait for the old IO, in order to start IO on it again.
Now, I don't think that the third option is sane for MS_ASYNC (ie I don'tthink even you want -that- behaviour), but in general, all these threechoices are actually sane. Notice how none of them actually involvewaiting for the new _result_. It's only a question about whether to waitfor an old write when we start a new one, or leave the new one entirely_unstarted_.


3 is obviously wrong because it blocks. 1 is what we have now which
I'm arguing against (on efficiency grounds). So that leaves us with 2,
which is an acceptable compromise for a situation which isn't likely
to come up much with a well coded app.

fadvise(fd, 100, 200, FADV_ASYNC_WRITE);
fadvise(fd, 300, 400, FADV_ASYNC_WRITE);
fadvise(fd, 100, 200, FADV_WRITE_WAIT);
fadvise(fd, 300, 400, FADV_WRITE_WAIT);
I'm saying that a valid pattern is

	.. dirty offset 100-200 ..
	fadvice(fd, 100, 200, FADV_WRITE_START_TRY);

	.. dirty offset 200-300 ..
	fadvice(fd, 200, 300, FADV_WRITE_START_TRY);

	.. dirty offset 300-400 ..
	fadvice(fd, 300, 400, FADV_WRITE_START_TRY);
is a valid thing to do ("try to start IO, but don't guarantee it") as away to get things going. But that would never pair up with a "wait forIO", because there's no guarantee that the IO got started (for example, wemay have started the IO when only bytes 100-200 were dirty, then wedirtied the other bytes, but we didn't re-start the IO for them becausethe previous IO to the same page was still pending, so the bytes never hitstorage and they aren't even outstanding).
But the "FADV_WRITE_START_TRY" is actually the best thing if what you aretrying to do is to keep changes _minimal_ so that when you later acutallyfinish the whole thing, you can do
	fadvice(fd, 100, 400, FADV_WRITE_WAIT);

which is your "write and wait".


[argh! that was actually Andrew's "wait for writeout", but OK ;)]

So far so good, and we don't actually care. The unconditional "write andwait" at the end means that it's irrelevant whether the "START_TRY" thingactually started the IO or not - the START_TRY thing _can_ be a no-op ifyou want to.
These sound like the semantics you want. No?


Yes (and I'd likewise argue that an efficient FADV_WRITE_START_TRY
implementation should really try to get IO going. Ie. exactly what
I'm arguing for MS_ASYNC).

And yes, I'm perfectly happy with them. I think this is what people woulddo. I just wanted to make sure that we're AWARE of the fact that itimplies that the ASYNC thing wouldn't necessarily always even start IO.


True. I believe our MS_ASYNC is technically within the standards.
I think it is suboptimal for sane users and against the spirit
of the spec.

And the reason I wanted to make sure of that is that the whole threadstarted from you complaining about MS_ASYNC not starting the IO. I'msaying that if you _require_ starting of IO, then the FADV_WRITE_WAITactually sensibly has different semantics, which can be a lot cheaper todo in the presense of other writers (ie then the write-wait would onlyneed to wait for any outstanding IO, not start writing out stuff thatsomebody else had written).
And the reason I wanted to take up the semantic difference is becausethere _are_ semantic differences.
If you only "commit" things when you have nothing dangling, you'll see theabove patterns. But it's a valid thing to commit things after you've made"further" log changes (that you're _not_ ready to commit). For example,


I don't think so in general because userspace can't guarantee something
*is not* sent to backing store, only that it *is*.

say that your log is really dirtying all the time, but you synchronize itat certain points and write the pointer to the synchronized statesomewhere else. What would you do?
Your pattern would actually be

	.. dirty offset 100-200 ..
	fadvice(fd, 100, 200, FADV_WRITE_START);

	.. dirty offset 200-300 ..
	fadvice(fd, 200, 300, FADV_WRITE_START);

	.. dirty offset 300-400 ..
	fadvice(fd, 300, 400, FADV_WRITE_START);

	.. dirty offset 400-415 .. (for the next transaction)


- IOW if the app or OS crashed here it would be possible to see 400-415 on
the disk and none of the previous transactions (assuming we don't know
the page size).

- If you are saying that the app does know the page size, then it is
obvious that it is by no stretch of the imagination hand optimising
IO, because it will have started 3 different IOs for the same page.

- Or (final option) only the first fadvise started IO, then any or all
of the subsequent transactions might not be synched after FADV_JUST_WAIT
(depending on what the DMA to disk saw).

	fadvice(fd, 100, 400, FADV_JUST_WAIT); (for the previous one)
and here is where the semantics differ. The "always start IO, and justwait for IO" won't be waiting for the partial stuff (that doesn't matter).While the "write and wait" would synchronously write stuff that we justdon't care about (just because they happen to be on the same "IOgranularity" block).
This "unconditional write start" + "unconditional wait only" pattern intheory allows you to optimize all the IO patterns by hand, and have lesssynchronous waits, because it wouldn't wait for state that is dirty, butthat doesn't matter.


I'm not convinced. You above example was bogus.

But as long as people are _aware_ of this issue, I don't much care.
			Linus



--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com-

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Follow-Ups:
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Linus Torvalds <[email protected]>

References:
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: [email protected]
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Andrew Morton <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Linus Torvalds <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Linus Torvalds <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Nick Piggin <[email protected]>
- Re: msync() behaviour broken for MS_ASYNC, revert patch?
  - From: Linus Torvalds <[email protected]>

Prev by Date: Re: 2.6.16-rc2-mm1
Next by Date: Re: a couple of oopses with 2.6.14
Previous by thread: Re: msync() behaviour broken for MS_ASYNC, revert patch?
Next by thread: Re: msync() behaviour broken for MS_ASYNC, revert patch?
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]