Re: Distributed storage. Move away from char device ioctls.

On Sep 15, 2007, at 13:24:46, Andreas Dilger wrote:

On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
Yes, block device itself is not able to scale well, but it is theplace for redundancy, since filesystem will just fail ifunderlying device does not work correctly and FS actually does notknow about where it should place redundancy bits - it might happento be the same broken disk, so I created a low-level device whichdistribute requests itself.
I actually think there is a place for this - and improvements aredefinitely welcome. Even Lustre needs block-device levelredundancy currently, though we will be working to make Lustre-level redundancy available in the future (the problem is WAY harderthan it seems at first glance, if you allow writeback caches at theclients and servers).

I really think that to get proper non-block-device-level filesystemredundancy you need to base it on something similar to the GITmodel. Data replication is done in specific-sized chunks indexed bySHA-1 sum and you actually have a sort of "merge algorithm" for whenlocal and remote changes differ. The OS would only implement a verylimited list of merge algorithms, IE one of:


(A)  Don't merge, each client gets its own branch and merges are manual

(B) Most recent changed version is made the master every X-seconds/open/close/write/other-event.(C) The tree at X (usually a particular client/server) is alwaysused as the master when there are conflicts.

This lets you implement whatever replication policy you want: Youcan require that some files are replicated (cached) on *EVERY*system, you can require that other files are cached on at least Xsystems. You can say "this needs to be replicated on at least X% ofthe online systems, or at most Y". Moreover, the replication couldbe done pretty easily from userspace via a couple syscalls. You alsoautomatically keep track of history with some default purge policy.

The main point is that for efficiency and speed things are *not*always replicated; this also allows for offline operation. You wouldof course have "userspace" merge drivers which notice that the treeon your laptop is not a subset/superset of the tree on your desktopand do various merges based on per-file metadata. My address-book,for example, would have a custom little merge program which knowsabout how to merge changes between two address book files, asking meuseful questions along the way. Since a lot of this merging ismechanical, some of the code from GIT could easily be made into a"merge library" which knows how to do such things.

Moreover, this would allow me to have a "shared" root filesystem onmy laptop and desktop. It would have 'sub-project'-type trees, sothat "/" would be an independent branch on each system. "/etc" wouldbe separate branches but manually merged git-style as I makechanges. "/home/*" folders would be auto-created as separatesubtrees so each user can version their own individually. Specificsubfolders (like address-book, email, etc) would be adjusted by theGUI programs that manage them to be separate subtrees with manual-merging controlled by that GUI program.

Backups/dumps/archival of such a system would be easy. You wouldjust need to clone the significant commits/trees/etc to a DVD andreplace the old SHA-1-indexed objects to tiny "object-deleted" stubs;to rollback to an archived version you insert the DVD, "mount" itinto the existing kernel SHA-1 index, and then mount the appropriatecommit as a read-only volume somewhere to access. The same procedurewould also work for wide-area-network backups and such.

The effective result would be the ability to do things like thefollowing:(A) Have my homedir synced between both systems mostly-automatically as I make changes to different files on both systems(B) Easily have 2 copies of all my files, so if one system's diskgoes kaput I can just re-clone from the other.(C) Keep archived copies of the last 5 years worth of work,including change history, on a stack of DVDs.(D) Synchronize work between locations over a relatively slowlink without much work.

As long as files were indirectly indexed by sub-block SHA1 (with theindex depth based on the size of the file), and each individually-SHA1-ed object could have references, you could trivially have a 4TB-sized file where you modify 4 bytes at a thousand random locationsthroughout the file and only have to update about 5MB worth of on-disk data. The actual overhead for that kind of operation under anyexisting filesystem would be 100% seek-dominated regardless whereaswith this mechanism you would not directly be overwriting data and soyou could append all the updates as a single 5MB chunk. Data readswould be much more seek-y, but you could trivially have an on-linedefragmenter tool which notices fragmented commonly-accessed inodeobjects and creates non-fragmented copies before deleting the old ones.

There's a lot of other technical details which would need resolutionin an actual implementation, but this is enough of a summary to giveyou the gist of the concept. Most likely there will be some majorflaw which makes it impossible to produce reliably, but the conceptcontains the things I would be interested in for a real "networkedfilesystem".


Cheers,
Kyle Moffett
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

References:
- Distributed storage. Move away from char device ioctls.
  - From: Evgeniy Polyakov <[email protected]>
- Re: Distributed storage. Move away from char device ioctls.
  - From: Jeff Garzik <[email protected]>
- Re: Distributed storage. Move away from char device ioctls.
  - From: Evgeniy Polyakov <[email protected]>
- Re: Distributed storage. Move away from char device ioctls.
  - From: Andreas Dilger <[email protected]>

Prev by Date: Re: [PATCH] jumbo all-NICs ethtool count cleanup
Next by Date: Re: On thread scheduling
Previous by thread: Re: Distributed storage. Move away from char device ioctls.
Next by thread: Re: Distributed storage. Move away from char device ioctls.
Index(es):
- Date
- Thread

[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]