> "Jonathan Roberts" <jonathan.roberts.uk@xxxxxxxxxxxxxx> writes: > >> I have several folders each approx 10-20 Gb in size. Each has some >> unique material and some duplicate material, and it's even possible >> there's duplicate material in sub-folders too. How can I consolidate >> all of this into a single folder so that I can easily move the backup >> onto different mediums, and get back some disk space!? > > An rsync-y solution not yet mentioned is to copy each dir 'd' to > 'd.tidied' while giving a --compare-dest=... flag for each of the > _other_ dirs. 'd.tidied' will end up stuff unique to 'd'. You > can then combine them all with 'tar' or 'cp' or whatever. > > You could use the 'sha1sum' tool to test that files in the > *.tidied dirs really are unique. > > This technique will catch identical files with like names, e.g. > > d1/foo/bar/wibble > d2/foo/bar/wibble > > but not > > d1/foo/bar/wibble > d2/bar/wibble/foo/wobble > > (if that makes sense). rsync --compare-dest and --link-dest : fantastic. I wrote a program MANY years back that searches for duplicate files. (I had a huge number of files from back in the BBS days that had the same file but different names.) Here is how I did it. (This was done using Perl 4.0 originally.) Recurse through all the directories and build a hash of the file sizes. Go through the hash table and look for collisions. (This prevents you from doing an MD5SUM on very large files that occur once.) For each set of collisions, build a hash table of MD5SUMS (the program now uses SHA512). Take any hash collisions and add them to a stack. Prompt the user what to do with those entries. There is also another optimization to the above. The first hash should only take the first 32k or so. If there are collisions, then hash the whole file and check for collisions on those. This two pass check speeds things up by a great deal of you have many large files of the same size. (Multi-part archives, for example.) Using this method I have removed all the duplicate files on a terabyte drive in about 3 hours or so. (Without the above optimization.) BTW, for all you patent trolls... The idea was also independently done by Phil Karns back in the early 90s as well. (I think I did this around 1996-97.) I have not publically released the source for a couple of reasons. 1) the program is a giant race condition. 2) the code was written before I was that great of a Perl programmer and has some real fugly bits in it. I am rewriting it with lots of added features and optimizations beyond those mentioned above.