Skip navigation.
 
mlRe: Best way of identifying duplicate files in Cocoa
FROM : Michael Watson
DATE : Tue Nov 20 23:48:46 2007

I implemented MD5 hashing and comparison in a file diff utility I 
wrote for internal use, and I gotta say . . . it was *fast* with tens 
of thousands of files of varying size. (Say, anywhere from 4KB to 
dozens of megs.)


--
m-s

On 20 Nov, 2007, at 16:42, Frank Reiff wrote:

> Hi Jean-Daniel,
>
> Thanks for your response.
>
> On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:
>

>>
>> Le 16 nov. 07 à 14:25, Frank Reiff a écrit :

>>> Another issue is of course performance. Comparing byte-by-byte is 
>>> certainly the simplest and most reliable way of doing this, but 
>>> it's SLOW.. on the other hand I don't really know what the 
>>> performance characteristics of an MD5, CRC32, or SHA hash are and 
>>> whether or not you need to read in the whole file contents to 
>>> apply them..
>>>
>>> It would thus be great if somebody, somewhere had published a 
>>> ready-to-use - (BOOL) file: (NSString*) path isIdenticalTo: 
>>> (NSString*) path2; method :-)
>>>
>>> I've spent the last two hours searching the web, but I haven't 
>>> found anything that comes close..

>>
>> You don't have to check byte-by-byte if the two files have a 
>> different size.
>> Then, comparing byte-per-byte is not so slow, as you can abort the 
>> comparaison as soon as two bytes are differents.
>>
>> Using a hash method has no benefit to compare two files on the 
>> disk. It's only usefull if you want to compare a remote file (with 
>> precomputed hash) and a local file.

>
> I'll probably be going with:
>
> * check length
> * check last few bytes (begin with the same bytes but do not finish 
> with them)
> * check byte-by-byte
>
> Computing a hash could be interesting in situations where there are 
> lots and lots of files with the same length. Instead of having to 
> compare each file with all other files of the same length, one could 
> simply compute the hash by traversing it once and then compare the 
> hashes instead. Of course in order to be 100% certain one would need 
> to then do another byte-by-byte check again. Alternatively one could 
> cash the relationships between all files, e.g. A != B and B == C 
> means A != C and C! = A
>
> I can see this could be fun :-)
>
> Best regards,
>
> Frank_______________________________________________
>
> Cocoa-dev mailing list (<email_removed>)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/mikey-san
> %40bungie.org
>
> This email sent to <email_removed>

Related mailsAuthorDate
mlBest way of identifying duplicate files in Cocoa Frank Reiff Nov 16, 14:25
mlRe: Best way of identifying duplicate files in Cocoa Jean-Daniel Dupas Nov 16, 14:46
mlRe: Best way of identifying duplicate files in Cocoa matt.gough Nov 16, 14:57
mlRe: Best way of identifying duplicate files in Cocoa Frank Reiff Nov 20, 22:42
mlRe: Best way of identifying duplicate files in Cocoa Michael Watson Nov 20, 23:48
mlRe: Best way of identifying duplicate files in Cocoa Bill Bumgarner Nov 21, 00:33
mlRe: Best way of identifying duplicate files in Cocoa Jean-Daniel Dupas Nov 21, 10:33
mlRe: Best way of identifying duplicate files in Cocoa Bill Bumgarner Nov 21, 10:55
mlRe: Best way of identifying duplicate files in Cocoa Army Research Lab Nov 21, 13:21
mlRe: Best way of identifying duplicate files in Cocoa Frank Reiff Nov 21, 15:23
mlRe: Best way of identifying duplicate files in Cocoa Frank Reiff Nov 21, 15:32
mlRe: Best way of identifying duplicate files in Cocoa Frank Reiff Nov 21, 15:40