Re: Finding Duplicate Files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Alan wrote:

(if that makes sense). rsync --compare-dest and --link-dest : fantastic.

I wrote a program MANY years back that searches for duplicate files. (I
had a huge number of files from back in the BBS days that had the same
file but different names.)

Here is how I did it. (This was done using Perl 4.0 originally.)

Recurse through all the directories and build a hash of the file sizes. Go through the hash table and look for collisions. (This prevents you
from doing an MD5SUM on very large files that occur once.)  For each set
of collisions, build a hash table of MD5SUMS (the program now uses
SHA512).  Take any hash collisions and add them to a stack. Prompt the
user what to do with those entries.

There is also another optimization to the above.  The first hash should
only take the first 32k or so.  If there are collisions, then hash the
whole file and check for collisions on those.  This two pass check speeds
things up by a great deal of you have many large files of the same size. (Multi-part archives, for example.) Using this method I have removed all
the duplicate files on a terabyte drive in about 3 hours or so.  (Without
the above optimization.)

I suppose it is a little late to mention this now, but backuppc (http://backuppc.sourceforge.net/) does this automatically as it copies in files and compresses them in addition to eliminating the duplication. If you used it instead of an ad-hoc set of copies as backups in the first place you'd have a web browser view of everything in its original locations at the backup intervals, but taking up less space that one original copy (depending on the amount of change...).

--
  Les Mikesell
   lesmikesell@xxxxxxxxx


[Index of Archives]     [Current Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Yosemite News]     [Yosemite Photos]     [KDE Users]     [Fedora Tools]     [Fedora Docs]

  Powered by Linux