Best way of identifying duplicate files in Cocoa

  • Hi,

    For my latest project, I need to be able to check whether the files
    (or file bundles) at two paths are duplicates (= have the same
    content) or not.

    While this would be ridiculously easy to do on a plain Unix system, it
    turns out to be a major undertaking on OS X.

    * file bundles aren't files, but directories, so in fact I need to be
    able to compare directories (invisible files included)
    * OS 9 files still have resource forks as well as data forks, so both
    need to be checked

    Another issue is of course performance. Comparing byte-by-byte is
    certainly the simplest and most reliable way of doing this, but it's
    SLOW.. on the other hand I don't really know what the performance
    characteristics of an MD5, CRC32, or SHA hash are and whether or not
    you need to read in the whole file contents to apply them..

    It would thus be great if somebody, somewhere had published a ready-to-
    use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*) path2;
    method :-)

    I've spent the last two hours searching the web, but I haven't found
    anything that comes close..

    Has anybody ever come across such a library?

    If not, would you be using a hash function rather than a byte-by-byte
    comparison?

    All this would be for Mac OS X 10.4 or later.

    I'd appreciate your insights, thanks for your time.

    Best regards,

    Frank
  • Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
    > Another issue is of course performance. Comparing byte-by-byte is
    > certainly the simplest and most reliable way of doing this, but
    > it's SLOW.. on the other hand I don't really know what the
    > performance characteristics of an MD5, CRC32, or SHA hash are and
    > whether or not you need to read in the whole file contents to apply
    > them..
    >
    > It would thus be great if somebody, somewhere had published a ready-
    > to-use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*)
    > path2; method :-)
    >
    > I've spent the last two hours searching the web, but I haven't
    > found anything that comes close..

    You don't have to check byte-by-byte if the two files have a
    different size.
    Then, comparing byte-per-byte is not so slow, as you can abort the
    comparaison as soon as two bytes are differents.

    Using a hash method has no benefit to compare two files on the disk.
    It's only usefull if you want to compare a remote file (with
    precomputed hash) and a local file.
  • On 16 Nov 2007, at 14:25, Frank Reiff wrote:

    > It would thus be great if somebody, somewhere had published a ready-
    > to-use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*)
    > path2; method :-)
    >

    Maybe you could call down to rsync with the -n option. Not tested it
    myself, but it should show you what it would like to do to get the two
    paths in sync. If it needs to do anything then I guess that would
    imply they are different.

    Matt Gough
  • Hi Jean-Daniel,

    Thanks for your response.

    On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:

    >
    > Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
    >> Another issue is of course performance. Comparing byte-by-byte is
    >> certainly the simplest and most reliable way of doing this, but
    >> it's SLOW.. on the other hand I don't really know what the
    >> performance characteristics of an MD5, CRC32, or SHA hash are and
    >> whether or not you need to read in the whole file contents to apply
    >> them..
    >>
    >> It would thus be great if somebody, somewhere had published a ready-
    >> to-use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*)
    >> path2; method :-)
    >>
    >> I've spent the last two hours searching the web, but I haven't
    >> found anything that comes close..
    >
    > You don't have to check byte-by-byte if the two files have a
    > different size.
    > Then, comparing byte-per-byte is not so slow, as you can abort the
    > comparaison as soon as two bytes are differents.
    >
    > Using a hash method has no benefit to compare two files on the disk.
    > It's only usefull if you want to compare a remote file (with
    > precomputed hash) and a local file.

    I'll probably be going with:

    * check length
    * check last few bytes (begin with the same bytes but do not finish
    with them)
    * check byte-by-byte

    Computing a hash could be interesting in situations where there are
    lots and lots of files with the same length. Instead of having to
    compare each file with all other files of the same length, one could
    simply compute the hash by traversing it once and then compare the
    hashes instead. Of course in order to be 100% certain one would need
    to then do another byte-by-byte check again. Alternatively one could
    cash the relationships between all files, e.g. A != B and B == C means
    A != C and C! = A

    I can see this could be fun :-)

    Best regards,

    Frank
  • I implemented MD5 hashing and comparison in a file diff utility I
    wrote for internal use, and I gotta say . . . it was *fast* with tens
    of thousands of files of varying size. (Say, anywhere from 4KB to
    dozens of megs.)

    --
    m-s

    On 20 Nov, 2007, at 16:42, Frank Reiff wrote:

    > Hi Jean-Daniel,
    >
    > Thanks for your response.
    >
    > On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:
    >
    >>
    >> Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
    >>> Another issue is of course performance. Comparing byte-by-byte is
    >>> certainly the simplest and most reliable way of doing this, but
    >>> it's SLOW.. on the other hand I don't really know what the
    >>> performance characteristics of an MD5, CRC32, or SHA hash are and
    >>> whether or not you need to read in the whole file contents to
    >>> apply them..
    >>>
    >>> It would thus be great if somebody, somewhere had published a
    >>> ready-to-use - (BOOL) file: (NSString*) path isIdenticalTo:
    >>> (NSString*) path2; method :-)
    >>>
    >>> I've spent the last two hours searching the web, but I haven't
    >>> found anything that comes close..
    >>
    >> You don't have to check byte-by-byte if the two files have a
    >> different size.
    >> Then, comparing byte-per-byte is not so slow, as you can abort the
    >> comparaison as soon as two bytes are differents.
    >>
    >> Using a hash method has no benefit to compare two files on the
    >> disk. It's only usefull if you want to compare a remote file (with
    >> precomputed hash) and a local file.
    >
    > I'll probably be going with:
    >
    > * check length
    > * check last few bytes (begin with the same bytes but do not finish
    > with them)
    > * check byte-by-byte
    >
    > Computing a hash could be interesting in situations where there are
    > lots and lots of files with the same length. Instead of having to
    > compare each file with all other files of the same length, one could
    > simply compute the hash by traversing it once and then compare the
    > hashes instead. Of course in order to be 100% certain one would need
    > to then do another byte-by-byte check again. Alternatively one could
    > cash the relationships between all files, e.g. A != B and B == C
    > means A != C and C! = A
    >
    > I can see this could be fun :-)
    >
    > Best regards,
    >
    > Frank
  • On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
    > I implemented MD5 hashing and comparison in a file diff utility I
    > wrote for internal use, and I gotta say . . . it was *fast* with
    > tens of thousands of files of varying size. (Say, anywhere from 4KB
    > to dozens of megs.)

    So did I!  Here is source:

    http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py

    It checks the file sizes and then hashes the first 4k.  Finally, it'll
    hash the full file if the sizes and first 4k matches.

    b.bum
  • > On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
    >> I implemented MD5 hashing and comparison in a file diff utility I
    >> wrote for internal use, and I gotta say . . . it was *fast* with
    >> tens of thousands of files of varying size. (Say, anywhere from
    >> 4KB to dozens of megs.)
    >
    > So did I!  Here is source:
    >
    > http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py
    >
    > It checks the file sizes and then hashes the first 4k.  Finally,
    > it'll hash the full file if the sizes and first 4k matches.
    >
    > b.bum

    To get a MD5 you have to read the file, AND compute the digest. To
    compare to file, you just have to read the file. What is the benefit
    of the MD5 in this case?

    I use plain C to be able to compare forks, but you can easyli replace
    CFDataRef by NSData

    static __inline__
    OSStatus SOFileReadChunk(SInt16 aFork, CFIndex length,
    CFMutableDataRef buffer) {
      ByteCount size = 0;
      OSStatus err = noErr;
      ByteCount remaining = length;
      CFDataSetLength(buffer, length);

      void *buf = CFDataGetMutableBytePtr(buffer);
      do {
        err = FSReadFork(aFork, fsAtMark | kFSNoCacheMask, 0, remaining,
    buf, &size);
        if (noErr == err || eofErr == err)
          remaining -= size;
      } while (remaining > 0 && noErr == err);

      CFDataSetLength(buffer, length - remaining);
      return err;
    }

    #define BUFFER_SIZE_KB 32
    static
    OSStatus SOCompareFork(FSRef *f1, FSRef *f2, HFSUniStr255 *forkName,
    bool *equals) {
      OSStatus err = noErr;
      SInt16 fnum1 = 0, fnum2 = 0;

      err = FSOpenFork(f1, forkName->length, forkName->unicode,
    fsRdPerm, &fnum1);
      if (noErr == err)
        err = FSOpenFork(f2, forkName->length, forkName->unicode,
    fsRdPerm, &fnum2);

      if (noErr == err) {
        *equals = true;
        CFMutableDataRef d1 = CFDataCreateMutable(kCFAllocatorDefault,
    1024 * BUFFER_SIZE_KB);
        CFMutableDataRef d2 = CFDataCreateMutable(kCFAllocatorDefault,
    1024 * BUFFER_SIZE_KB);

        do {
          err = SOFileReadChunk(fnum1, 1024 * BUFFER_SIZE_KB, d1);
          if (noErr == err || eofErr == err)
            err = SOFileReadChunk(fnum2, 1024 * BUFFER_SIZE_KB, d2);
          if (noErr == err || eofErr == err)
            *equals = CFEqual(d1, d2);
        } while (noErr == err && *equals);

        CFRelease(d2);
        CFRelease(d1);
      }
      if (eofErr == err) err = noErr;
      if (fnum2) verify_noerr(FSCloseFork(fnum2));
      if (fnum1) verify_noerr(FSCloseFork(fnum1));

      return err;
    }
  • On Nov 21, 2007, at 1:33 AM, Jean-Daniel Dupas wrote:
    > To get a MD5 you have to read the file, AND compute the digest. To
    > compare to file, you just have to read the file. What is the benefit
    > of the MD5 in this case?

    I can MD5 the first N bytes and build up a shallow tree of all files
    that are (a) of the same size and (b) are identical for the first 1024
    bytes (as hashed by a checksum of said bytes).

    From there, yes, calculating an md5 of full file contents is a
    complete waste of time when comparing whole files.  Given the relative
    infrequency of files that are identical within the first 1K, the
    silliness of those particular lines of code were never identified as a
    performance bottleneck. ;)

    b.bum
  • > ------------------------------
    >
    > Message: 16
    > Date: Fri, 16 Nov 2007 14:25:31 +0100
    > From: Frank Reiff <reiff...>
    > Subject: Best way of identifying duplicate files in Cocoa
    > To: "Cocoa-Dev (Apple)" <cocoa-dev...>
    > Message-ID: <9E5B369B-84A0-4A31-87DB-04FDA19B633D...>
    > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
    >
    > Hi,
    >
    > For my latest project, I need to be able to check whether the files
    > (or file bundles) at two paths are duplicates (= have the same
    > content) or not.
    >
    > While this would be ridiculously easy to do on a plain Unix system, it
    > turns out to be a major undertaking on OS X.
    >
    > * file bundles aren't files, but directories, so in fact I need to be
    > able to compare directories (invisible files included)
    > * OS 9 files still have resource forks as well as data forks, so both
    > need to be checked
    >
    > Another issue is of course performance. Comparing byte-by-byte is
    > certainly the simplest and most reliable way of doing this, but it's
    > SLOW.. on the other hand I don't really know what the performance
    > characteristics of an MD5, CRC32, or SHA hash are and whether or not
    > you need to read in the whole file contents to apply them..
    >
    > It would thus be great if somebody, somewhere had published a ready-to-
    > use - (BOOL) file: (NSString*) path isIdenticalTo: (NSString*) path2;
    > method :-)

    [SNIP]

    I had to do something similar to this, but I did it using Python.  The
    method I used is fairly fast (about 3 GB with 30,000 files, in about 27
    seconds of user time on a Xeon 3 GHz machine), and it worked well.  I
    basically built trees and dictionaries to do it.

    1) Build a tree of all the files you're looking into, including the
    directories.  The leaf nodes are the files, the interior nodes are the
    directories.  CoreData is a good idea for this.

    2) Depth first, and recursively do the following:
        a) If the node is a file, its value is its hash
        b) If the node is a directory, sort its direct descendents by their
          hashes, and then treat that array as if it were a single string, and
          hash it.  This is the directory's hash.

    3) Create a dictionary where the keys are hashes and values are lists of
    paths.  Do this by walking the tree from step 2, but this time starting at
    the root of the tree, and do it breadth first.  Any time you have a list
    with 2 items in it, then everything below those points are probably the
    same.  This is true of directories, bundles, etc.  There is no point in
    walking below the tree of either side, so you can trim the subtrees off from
    the tree in step 2, or you can just skip them.  I skip them, it makes
    debugging easier if the information is still around.

    4) At this point, you'll have to do a file by file comparison to be sure
    that you really have duplicates.  BTW, if you DO find collisions in the
    SHA-1 hash, and the file contents are different, tell NIST (www.nist.gov)
    about it; they'd be interested to know (assuming you're willing to share the
    directory contents for them to poke at)

    Also note that my method will miss certain cases, e.g. if A and B are
    identical trees, and C is a subtree of either one, then if I run into A and
    B first, I'll never realize that C is a subtree (because of my
    skipping/trimming step).  For me, this isn't a big deal.  My tool is just
    for cleaning up large amounts of garbage, rapidly.  Once I delete either A
    or B, I can then rerun my tool and pick up C as well.  Maybe not the
    cleanest way, but it works for me.

    Thanks,
    Cem Karan
  • Dear Jean-Daniel,

    Thanks for the snippet.

    This is all for a later release of the software, so I won't get around
    to actually trying this for a few weeks yet.. for the moment I'm
    eliminating memory leaks with Core Data, optimizing performance and
    sorting out multi-threading bugs :-(

    I can see the point of using the md5 (or any other hash really), but
    it's also a question of reliability v. performance. The more code
    you've got, the more opportunities you have to mess something up. I'm
    not sure yet whether performance is going to be a huge issue or not.
    I'll probably first implement something quite simple but (hopefully)
    bullet proof and then I might look into optimizing this over time.

    I'm not actually writing one of those "eliminate duplicate files to
    make space on your hard disk" kind of utilities and I only really need
    to check files which have identical names but are in different
    directories. So I can go with:

    1) not the same name
    2) not the same size
    3) not the same content

    and 1) should take care of 99% of cases and 2) will probably put a big
    dent into what's left.

    Long term the option of checking all files for duplicates irrespective
    of their names might be a cool new feature to add, so more complex
    schemes will come into their own there. Of course there always is the
    guy who has five million files with the same name and the same 5GB
    length but different content in the last three bytes out there.. but
    perhaps I'll wait for him to get in touch with me first (It's really
    slow!) :-)

    Best regards,

    Frank

    On 21 Nov 2007, at 10:33, Jean-Daniel Dupas wrote:

    >
    >> On Nov 20, 2007, at 2:48 PM, Michael Watson wrote:
    >>> I implemented MD5 hashing and comparison in a file diff utility I
    >>> wrote for internal use, and I gotta say . . . it was *fast* with
    >>> tens of thousands of files of varying size. (Say, anywhere from
    >>> 4KB to dozens of megs.)
    >>
    >> So did I!  Here is source:
    >>
    >> http://svn.red-bean.com/bbum/trunk/hacques/dupinator.py
    >>
    >> It checks the file sizes and then hashes the first 4k.  Finally,
    >> it'll hash the full file if the sizes and first 4k matches.
    >>
    >> b.bum
    >
    > To get a MD5 you have to read the file, AND compute the digest. To
    > compare to file, you just have to read the file. What is the benefit
    > of the MD5 in this case?
    >
    > I use plain C to be able to compare forks, but you can easyli
    > replace CFDataRef by NSData
    >
    > static __inline__
    > OSStatus SOFileReadChunk(SInt16 aFork, CFIndex length,
    > CFMutableDataRef buffer) {
    > ByteCount size = 0;
    > OSStatus err = noErr;
    > ByteCount remaining = length;
    > CFDataSetLength(buffer, length);
    >
    > void *buf = CFDataGetMutableBytePtr(buffer);
    > do {
    > err = FSReadFork(aFork, fsAtMark | kFSNoCacheMask, 0, remaining,
    > buf, &size);
    > if (noErr == err || eofErr == err)
    > remaining -= size;
    > } while (remaining > 0 && noErr == err);
    >
    > CFDataSetLength(buffer, length - remaining);
    > return err;
    > }
    >
    > #define BUFFER_SIZE_KB 32
    > static
    > OSStatus SOCompareFork(FSRef *f1, FSRef *f2, HFSUniStr255 *forkName,
    > bool *equals) {
    > OSStatus err = noErr;
    > SInt16 fnum1 = 0, fnum2 = 0;
    >
    > err = FSOpenFork(f1, forkName->length, forkName->unicode, fsRdPerm,
    > &fnum1);
    > if (noErr == err)
    > err = FSOpenFork(f2, forkName->length, forkName->unicode,
    > fsRdPerm, &fnum2);
    >
    > if (noErr == err) {
    > *equals = true;
    > CFMutableDataRef d1 = CFDataCreateMutable(kCFAllocatorDefault,
    > 1024 * BUFFER_SIZE_KB);
    > CFMutableDataRef d2 = CFDataCreateMutable(kCFAllocatorDefault,
    > 1024 * BUFFER_SIZE_KB);
    >
    > do {
    > err = SOFileReadChunk(fnum1, 1024 * BUFFER_SIZE_KB, d1);
    > if (noErr == err || eofErr == err)
    > err = SOFileReadChunk(fnum2, 1024 * BUFFER_SIZE_KB, d2);
    > if (noErr == err || eofErr == err)
    > *equals = CFEqual(d1, d2);
    > } while (noErr == err && *equals);
    >
    > CFRelease(d2);
    > CFRelease(d1);
    > }
    > if (eofErr == err) err = noErr;
    > if (fnum2) verify_noerr(FSCloseFork(fnum2));
    > if (fnum1) verify_noerr(FSCloseFork(fnum1));
    >
    > return err;
    > }
  • Hi Bill,

    Thanks for code example. It always seems that this type of thing is
    better done in a scripting language. It's something about the economy
    of the language. I've been using Ruby for around a year now and it
    really rocks for this type of stuff.

    The MD5 clearly is the right solution for comparing large numbers of
    files between each other. I'm not sure I'll be needing such a heavy-
    duty solution, but you never know.. once you add a feature people find
    ways of testing them to the limit..

    What's more don't want to keep anything much in memory, so the tree
    would need to built inside of a Core Data persistent store which might
    negate any performance gains.

    I haven't started implementing the feature yet, but I'll be sure to
    have a look at your code before embarking on it and I'll see how much
    of performance bottleneck it will be..

    Best regards,

    Frank

    On 21 Nov 2007, at 10:55, Bill Bumgarner wrote:

    > On Nov 21, 2007, at 1:33 AM, Jean-Daniel Dupas wrote:
    >> To get a MD5 you have to read the file, AND compute the digest. To
    >> compare to file, you just have to read the file. What is the
    >> benefit of the MD5 in this case?
    >
    > I can MD5 the first N bytes and build up a shallow tree of all files
    > that are (a) of the same size and (b) are identical for the first
    > 1024 bytes (as hashed by a checksum of said bytes).
    >
    > From there, yes, calculating an md5 of full file contents is a
    > complete waste of time when comparing whole files.  Given the
    > relative infrequency of files that are identical within the first
    > 1K, the silliness of those particular lines of code were never
    > identified as a performance bottleneck. ;)
    >
    > b.bum
  • Hi Cem,

    > I had to do something similar to this, but I did it using Python.  The
    > method I used is fairly fast (about 3 GB with 30,000 files, in about
    > 27
    > seconds of user time on a Xeon 3 GHz machine), and it worked well.  I
    > basically built trees and dictionaries to do it.
    >
    > 1) Build a tree of all the files you're looking into, including the
    > directories.  The leaf nodes are the files, the interior nodes are the
    > directories.  CoreData is a good idea for this.

    I'm in Core Data already, so that wouldn't be problem.

    > 2) Depth first, and recursively do the following:
    > a) If the node is a file, its value is its hash
    > b) If the node is a directory, sort its direct descendents by their
    > hashes, and then treat that array as if it were a single
    > string, and
    > hash it.  This is the directory's hash.
    >
    > 3) Create a dictionary where the keys are hashes and values are
    > lists of
    > paths.  Do this by walking the tree from step 2, but this time
    > starting at
    > the root of the tree, and do it breadth first.  Any time you have a
    > list
    > with 2 items in it, then everything below those points are probably
    > the
    > same.  This is true of directories, bundles, etc.  There is no point
    > in
    > walking below the tree of either side, so you can trim the subtrees
    > off from
    > the tree in step 2, or you can just skip them.  I skip them, it makes
    > debugging easier if the information is still around.
    >
    > 4) At this point, you'll have to do a file by file comparison to be
    > sure
    > that you really have duplicates.  BTW, if you DO find collisions in
    > the
    > SHA-1 hash, and the file contents are different, tell NIST (www.nist.gov
    > )
    > about it; they'd be interested to know (assuming you're willing to
    > share the
    > directory contents for them to poke at)
    >
    > Also note that my method will miss certain cases, e.g. if A and B are
    > identical trees, and C is a subtree of either one, then if I run
    > into A and
    > B first, I'll never realize that C is a subtree (because of my
    > skipping/trimming step).  For me, this isn't a big deal.  My tool is
    > just
    > for cleaning up large amounts of garbage, rapidly.  Once I delete
    > either A
    > or B, I can then rerun my tool and pick up C as well.  Maybe not the
    > cleanest way, but it works for me.

    That's a fairly impressive algorithm..

    Luckily, I don't need to go quite that far because of the nature of
    the program I'm writing. I'm basically taking a bunch of files,
    computing new destination paths for them and then check whether there
    are any file name clashes (say two files called "1.jpg" and "1.jpg").
    If there are any clashes, I need to know whether they are duplicates
    (in which case I throw away the duplicate) or not ( in which case I
    change one of file names to make it unique, e.g. "1.jpg" and "1
    (1).jpg").

    There's certainly plenty to think about now..

    Thanks again for your insights.

    Best regards,

    Frank
previous month november 2007 next month
MTWTFSS
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    
Go to today