On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
> On Thu, 23 Aug 2007 12:47:23 +1000
> David Chinner <[email protected]> wrote:
>
> > On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
> > > I think we should assume a full scan of s_dirty is impossible in the
> > > presence of concurrent writers. We want to be able to pick a start
> > > time (right now) and find all the inodes older than that start time.
> > > New things will come in while we're scanning. But perhaps that's
> > > what you're saying...
> > >
> > > At any rate, we've got two types of lists now. One keeps track of
> > > age and the other two keep track of what is currently being
> > > written. I would try two things:
> > >
> > > 1) s_dirty stays a list for FIFO. s_io becomes a radix tree that
> > > indexes by inode number (or some arbitrary field the FS can set in
> > > the inode). Radix tree tags are used to indicate which things in
> > > s_io are already in progress or are pending (hand waving because
> > > I'm not sure exactly).
> > >
> > > inodes are pulled off s_dirty and the corresponding slot in s_io is
> > > tagged to indicate IO has started. Any nearby inodes in s_io are
> > > also sent down.
> >
> > the problem with this approach is that it only looks at inode
> > locality. Data locality is ignored completely here and the data for
> > all the inodes that are close together could be splattered all over
> > the drive. In that case, clustering by inode location is exactly the
> > wrong thing to do.
>
> Usually it won't be less wrong than clustering by time.
>
> >
> > For example, XFs changes allocation strategy at 1TB for 32bit inode
> > filesystems which makes the data get placed way away from the inodes.
> > i.e. inodes in AGs below 1TB, all data in AGs > 1TB. clustering
> > by inode number for data writeback is mostly useless in the >1TB
> > case.
>
> I agree we'll want a way to let the FS provide the clustering key. But
> for the first cut on the patch, I would suggest keeping it simple.
>
> >
> > The inode32 for <1Tb and inode64 allocators both try to keep data
> > close to the inode (i.e. in the same AG) so clustering by inode number
> > might work better here.
> >
> > Also, it might be worthwhile allowing the filesystem to supply a
> > hint or mask for "closeness" for inode clustering. This would help
> > the gernic code only try to cluster inode writes to inodes that
> > fall into the same cluster as the first inode....
>
> Yes, also a good idea after things are working.
>
> >
> > > > Notes:
> > > > (1) I'm not sure inode number is correlated to disk location in
> > > > filesystems other than ext2/3/4. Or parent dir?
> > >
> > > In general, it is a better assumption than sorting by time. It may
> > > make sense to one day let the FS provide a clustering hint
> > > (corresponding to the first block in the file?), but for starters it
> > > makes sense to just go with the inode number.
> >
> > Perhaps multiple hints are needed - one for data locality and one
> > for inode cluster locality.
>
> So, my feature creep idea would have been more data clustering. I'm
> mainly trying to solve this graph:
>
> http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
>
> Where background writing of the block device inode is making ext3 do
> seeky writes while directory trees. My simple idea was to kick
> off a 'I've just written block X' call back to the FS, where it may
> decide to send down dirty chunks of the block device inode that also
> happen to be dirty.
>
> But, maintaining the kupdate max dirty time and congestion limits in
> the face of all this clustering gets tricky. So, I wasn't going to
> suggest it until the basic machinery was working.
>
> Fengguang, this isn't a small project ;) But, lots of people will be
> interested in the results.
Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.
-fengguang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
[Index of Archives]
[Kernel Newbies]
[Netfilter]
[Bugtraq]
[Photo]
[Stuff]
[Gimp]
[Yosemite News]
[MIPS Linux]
[ARM Linux]
[Linux Security]
[Linux RAID]
[Video 4 Linux]
[Linux for the blind]
[Linux Resources]