[SCM saga] An alternative to git

[I sent this from my gmail account last night, but it did not appear to make 
it to the list, so I'm re-sending.  Sorry if you get this twice.]

Linus,

Background:
-----------
The SCM problem is something I've been toying with for a while now on my own 
because I wanted a distributed SCM tool, but didn't like one thing or another 
about the existing projects.  I thought monotone came very close to what I 
wanted, but I didn't like the repository as a binary blob.  So I've been 
playing with my own prototype SCM project.  I'd like to know what you think 
of this approach, and maybe observations on this prototype implementation.

The Repository:
---------------
The on-disk format looks like this:
linux-repository

linux_kernel_repo/
|-- cache
|-- nodes ("revisions" would be a better name)
|-- patches
`-- tags

Files are created in the repository, and never modified once they are there.  
Files are never deleted from the repository (with the possible exception of 
stuff in the cache directory).

The patches directory contains <hashofself>.patch files, which are basically
[comments]
[diffstat]
[diff -aurN]

The nodes directory contains <hashofself>.node files, which have
[author] [date]
[<hash>.node <hash>.patch]
... (There can be 0 or more [node,patch] tuples.)

The tags directory contains files named like this:
[email protected]
(i.e: <epoch time>-<author>-<tag-name>.tag)
that contain the name of a node: <hash>.node

Then finally, there is the cache directory.  With my current implementation, 
this area is used to cache unpacked revisions of the source tree for diffing 
and the like.  The contents of these directories are hard-linked to conserve 
some space.

Repository Implications:
------------------------
This repository layout allows for a number of neat features.
Like git, you can use rsync to push and pull everything from one repository to 
another.  (You'll probably want to exclude (most of) the cache directory.)
But even better than that, you can pull a cached node from a repository into a 
new repository, (such as linux-2.6.11) then pull your current head revision 
and patches tracking back to that node, giving you a minimal repository with 
just the history you want.  Then when you're happy with what you have, push 
it out.
That would let you have every revision of Linux known to man in a public 
repository somewhere (kernel.org?), but you could keep a much reduced copy of 
the repository on your working machine.
And we can do all of that without writing a dedicated SCM server: ftp, sftp, 
rsync, etc. should all work.

Since the patches don't reference a node in themselves, they can be applied to 
nodes other than the one they were created from.  I believe this will allow 
merging to work better than git would allow because we can track that the 
same patch was applied to both of two lines of development that we're trying 
to merge.

Current performance isn't very good, but I haven't really optimized it yet, 
either.  The numbers below are from commits to my repository on a 2.6GHz 
Celeron w/768MB RAM.

Commit 2.6.0	95.31user 354.33system 9:07.91elapsed 82%CPU
Commit 2.6.1	61.53user 139.10system 6:01.20elapsed 55%CPU
Commit 2.6.2	21.48user 32.11system 1:02.36elapsed 85%CPU
Commit 2.6.3	22.40user 31.51system 1:00.45elapsed 89%CPU
Commit 2.6.4	26.41user 42.79system 1:17.18elapsed 89%CPU
Commit 2.6.5	24.16user 38.97system 1:13.21elapsed 86%CPU
Commit 2.6.6	25.37user 38.99system 1:15.28elapsed 85%CPU
Commit 2.6.7	35.83user 55.06system 1:46.76elapsed 85%CPU
Commit 2.6.8	43.94user 70.02system 2:27.96elapsed 77%CPU
Commit 2.6.9	43.90user 75.02system 2:42.44elapsed 73%CPU
Commit 2.6.10	52.16user 89.46system 3:03.61elapsed 77%CPU
Commit 2.6.11	54.47user 93.21system 3:25.57elapsed 71%CPU
Commit 2.6.11.1	14.06user 29.82system 1:04.60elapsed 67%CPU
Commit 2.6.11.2	5.42user 12.66system 0:32.30elapsed 56%CPU
Commit 2.6.11.3	1.91user 1.35system 0:06.49elapsed 50%CPU
Commit 2.6.11.4	1.82user 1.41system 0:06.84elapsed 47%CPU
Commit 2.6.11.5	1.91user 1.31system 0:04.26elapsed 75%CPU
Commit 2.6.11.6	1.89user 1.35system 0:06.94elapsed 46%CPU
Commit 2.6.11.7	2.07user 1.50system 0:07.44elapsed 48%CPU

Doing some profiling leads me to believe that there is a lot of room for 
improvement for the nothing -> 2.6.0 and the 2.6.0 -> 2.6.1 commits.  I 
believe the 2.6.x commits can also be improved.
The 2.6.11.x commits tend to be <10 seconds.  Which should be your common 
case.
(Yes, I know git is blazingly fast, but are these numbers reasonable?  How far 
would they need to go to be reasonable?  I think these are comparable to what 
monotone would do, but they've done some optimization recently, so I'm 
probably slower.)

As for disk usage (du -sck) of the above repository:
152     tags
168     nodes
354572  patches
Those three directories contain all the history.
1379704 cache
And that can be re-created from the nodes+patches.
There is no compression going on here, which would be an easy way to improve 
that.  I haven't implemented that yet though.

The Sandbox:
------------
At the top directory of the sandbox, I have a .scm directory containing state 
information such as current revision and repository location.  The rest is 
just source code.  Pretty straight-forward.

Efficiency/Optimizations:
-------------------------
I used some of the ideas from the git discussion in this implementation.
The sandbox keeps some cached information in the .scm; the file stat 
information, and a current hash of each file.  That way it can look at only 
the files it needs to for committing, diffing, etc.
(Sadly, this may not have helped as much as I expected.  I have a version of 
my tool that just does a straight recursive diff on the whole thing, and the 
times are comparable.  That may just point to a really dumb bug on my part 
though.)
I do a lot of os.system()'s and os.popen()'s that could probably be redone.  
Doing a quick grep, it looks like I use: diff, sed, cp, filterdiff, diffstat, 
grep, tail, patch, find, xargs, stat, sort, sha1sum, rsync and $EDITOR.  (In 
my defense, I was aiming for speed of implementation--after all, git is 
progressing quite quickly.)
It might make sense to glue this together with git -- use your "content 
addressable file system" for the cache directory to get the speed, but use 
the the nodes and patches to record the history.

Implementation:
---------------
Attached is scm.py (pronounced "skimpy"), which as the name suggests, is a 
python implementation of the SCM system I've described.  I'm using Python 
2.3.4 on Fedora Core 3, and I haven't tested this on other 
distributions/platforms/etc.  (This is all original code I've done on my own 
time and equipment, and I'm releasing it under the GPL-v2-or-later, with 
absolutely no warranty, and the warning that it may kill your dog, wreck your 
car, etc.)

There are a number of known issues:
The program is a little user-hostile--when it runs into something unexpected, 
it's liable to raise an exception, print a back-trace, and exit.  Hopefully 
it'll give a useful hint in the exception.  It may even leave something 
in /tmp, though if it does, that's a bug I'd like to hear about.
Sometimes you'll get prompted by patch when scm calls out to it when it really 
should be set up to run non-interactively.
I have not dealt with file renames yet, but I expect that to become part of 
the patch "comment" area, much like the diffstat is.  I have not decided how 
to best represent that.
I have not implemented merging yet, but the groundwork is there.  I expect to 
do a breadth-first search for a common ancestor, find the patches in the line 
of development you are merging into your own that you do not already have in 
your own, apply them to your line of development creating nodes along the 
way, then let you deal with the broken parts.  As a first pass anyway.
I haven't directly addressed file permissions or directories yet.  I'm 
expecting those would wind up being handled similarly to file renames in the 
patch files.
If we want to add arbitrary attributes for patches and/or nodes, we could 
create a "attr" directory containing a directory named after the node or 
patch, and in that, a file like <epoch>-<author>-<attribute-name> containing 
the value of that attribute.  For that matter, we could re-do tags that way.

Quick Start:
------------
./scm.py help
./scm.py init <path-to-new-repository>
./scm.py --repo=<repository> checkout mysandbox
cd mysandbox
<copy your source in>
./scm.py diff
./scm.py diffstat
./scm.py commit
And this one is fun:
./scm.py --repo=<repository> graph | dot -Tps repo-history.ps

Main Points:
------------
I believe that this representation of the kernel history is better than git's 
because I believe it will improve our ability to do merges and more closely 
represents what is occurring.
I believe that the repository design will give us many of the same benefits as 
git, and some that git does not give us.
I believe that the implementation can be significantly improved.
The performance scales primarily by the size of the change, rather than the 
size of the code base or the size of the repository.


Ok, so that's scm.py.  Now, let me lock my ego in this nicely padded safe, don 
the asbestos coveralls, and cower behind this nice sturdy boulder....

Alright, pull out the flame throwers and have at it! :)

Eli
-----------------------. "If it ain't broke now,
Eli Carter              \             it will be soon." -- crypto-gram
[email protected] `--------------------------------------------
Attachment: scm.py
Description: application/python
Prev by Date: Re: insmod segfault in pci_find_subsys()
Next by Date: Re: EXPORT_SYMBOL_GPL for __symbol_get replacing EXPORT_SYMBOL for deprecated inter_module_get
Previous by thread: [SCM saga] An alternative to git
Next by thread: [patch 098/198] x86_64: Final support for AMD dual core
Index(es):
- Date
- Thread
[Index of Archives] [Kernel Newbies] [Netfilter] [Bugtraq] [Photo] [Stuff] [Gimp] [Yosemite News] [MIPS Linux] [ARM Linux] [Linux Security] [Linux RAID] [Video 4 Linux] [Linux for the blind] [Linux Resources]