Re: [RFC] fsblock — Linux Kernel

On 27 Jun 2007, at 12:50, Chris Mason wrote:

On Wed, Jun 27, 2007 at 07:32:45AM +0200, Nick Piggin wrote:

On Tue, Jun 26, 2007 at 08:34:49AM -0400, Chris Mason wrote:

On Tue, Jun 26, 2007 at 07:23:09PM +1000, David Chinner wrote:

On Tue, Jun 26, 2007 at 01:55:11PM +1000, Nick Piggin wrote:

[ ... fsblocks vs extent range mapping ]

iomaps can double as range locks simply because iomaps are
expressions of ranges within the file.  Seeing as you can only
access a given range exclusively to modify it, inserting an empty
mapping into the tree as a range lock gives an effective method of
allowing safe parallel reads, writes and allocation into the file.

The fsblocks and the vm page cache interface cannot be used to
facilitate this because a radix tree is the wrong type of tree to
store this information in. A sparse, range based tree (e.g. btree)
is the right way to do this and it matches very well with
a range based API.

I'm really not against the extent based page cache idea, but Ikind ofassumed it would be too big a change for this kind of genericsetup. At

any rate, if we'd like to do it, it may be best to ditch the idea of

"attach mapping information to a page", and switch to "lookupmapping

information and range locking for a page".

Well the get_block equivalent API is extent based one now, and I'll
look at what is required in making map_fsblock a more generic call
that could be used for an extent-based scheme.

An extent based thing IMO really isn't appropriate as the maingeneric

layer here though. If it is really useful and popular, then it could
be turned into generic code and sit along side fsblock or underneath
fsblock...

Lets look at a typical example of how IO actually gets done today,
starting with sys_write():

Yes, this is very inefficient which is one of the reasons I don't usethe generic file write helpers in NTFS. The other reasons are thatsupporting larger logical block sizes than PAGE_CACHE_SIZE becomes apain if it is not done this way when the write targets a hole as thatrequires all pages in the hole to be locked simultaneously whichwould mean dropping the page lock to acquire the others that are oflower page index and to then re-take the page lock which is horrible- much better to lock all at once from the outset and the otherreason is that in NTFS there is such a thing as the initialized sizeof an attribute which basically states "anything past this byteoffset must be returned as 0 on read, i.e. it does not have to beread from disk at all, and on write beyond the initialized_size youhave to zero on disk everything between the old initialized size andthe start of the write before you begin writing and certainly beforeyou update the initalized_size otherwise a concurrent read would seerandom old data from the disk.

For NTFS this effectively becomes:

sys_write(file, buffer, 1MB)

allocate space for the entire 1MB write

if write offset past the initialized_size zero out on disk startingat initialized_size up to the start offset for the write and updatethe initialized size to be equal to the start offset of the write

do {

if (current position is in a hole and the NTFS logical block size is> PAGE_CACHE_SIZE) {

		work on (NTFS logical block size / PAGE_CACHE_SIZE) pages in one go;
		do_pages = vol->cluster_size / PAGE_CACHE_SIZE;
	} else {
		work on only one page;
		do_pages = 1;
	}

fault in for read (do_pages*PAGE_CACHE_SIZE) bytes worth of sourcepages

	grab do_pages worth of pages
	prepare_write - attach buffers to grabbed pages
	copy data from source to grabbed&prepared pages
	commit_write the copied pages by dirtying their buffers
} while (data left to write);

The allocation in advance is a huge win both in terms of avoidingfragmentation (NTFS still uses a very simple/stupid allocator so youget a lot of fragmentation if two processes write to different filessimultaneously and do so in small chunks) and in terms of performance.

I have wondered whether I should perhaps turn on the "multi page"stuff on for all writes rather than just for ones that go into a holeand the logical size is greater than the PAGE_CACHE_SIZE as thatmight improve performance even further but I haven't had the time/inclination to experiment...

And I have also wondered whether to go direct to bio/wholes pages atonce instead of bothering with dirtying each buffer but the buffers(which are always 512 bytes on NTFS) allow me to easily supportdirtying smaller parts of the page which is desired at least onvolumes with a logical block size < PAGE_CACHE_SIZE as different bitsof the page could then reside on completely different locations ondisk so writing out unneeded bits of the page could result in a lotof wasted disk head seek times.

Best regards,

	Anton

for each page:
    prepare_write()
	allocate contiguous chunks of disk
        attach buffers
    copy_from_user()
    commit_write()
        dirty buffers

pdflush:
    writepages()
        find pages with contiguous chunks of disk
	build and submit large bios
So, we replace prepare_write and commit_write with an extent basedapi,
but we keep the dirty each buffer part.  writepages has to turn that
back into extents (bio sized), and the result is completely full ofdark
dark corner cases.

I do think fsblocks is a nice cleanup on its own, but Dave has a good
point that it makes sense to look for ways generalize things evenmore.
-chris

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- [RFC] fsblock
  - From: Nick Piggin <npiggin@suse.de>
- Re: [RFC] fsblock
  - From: David Chinner <dgc@sgi.com>
- Re: [RFC] fsblock
  - From: Nick Piggin <nickpiggin@yahoo.com.au>
- Re: [RFC] fsblock
  - From: David Chinner <dgc@sgi.com>
- Re: [RFC] fsblock
  - From: Chris Mason <chris.mason@oracle.com>
- Re: [RFC] fsblock
  - From: Nick Piggin <npiggin@suse.de>
- Re: [RFC] fsblock
  - From: Chris Mason <chris.mason@oracle.com>

Prev by Date: Re: NVidia Driver Support - 1680x1050 mode
Next by Date: [PATCH] Documentation/firmware_class/firmware_sample_driver.c
Previous by thread: Re: [RFC] fsblock
Next by thread: Re: [RFC] fsblock
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]