FROM : Michael Watson
DATE : Tue Nov 20 23:48:46 2007
I implemented MD5 hashing and comparison in a file diff utility I
wrote for internal use, and I gotta say . . . it was *fast* with tens
of thousands of files of varying size. (Say, anywhere from 4KB to
dozens of megs.)
--
m-s
On 20 Nov, 2007, at 16:42, Frank Reiff wrote:
> Hi Jean-Daniel,
>
> Thanks for your response.
>
> On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:
>
>>
>> Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
>>> Another issue is of course performance. Comparing byte-by-byte is
>>> certainly the simplest and most reliable way of doing this, but
>>> it's SLOW.. on the other hand I don't really know what the
>>> performance characteristics of an MD5, CRC32, or SHA hash are and
>>> whether or not you need to read in the whole file contents to
>>> apply them..
>>>
>>> It would thus be great if somebody, somewhere had published a
>>> ready-to-use - (BOOL) file: (NSString*) path isIdenticalTo:
>>> (NSString*) path2; method :-)
>>>
>>> I've spent the last two hours searching the web, but I haven't
>>> found anything that comes close..
>>
>> You don't have to check byte-by-byte if the two files have a
>> different size.
>> Then, comparing byte-per-byte is not so slow, as you can abort the
>> comparaison as soon as two bytes are differents.
>>
>> Using a hash method has no benefit to compare two files on the
>> disk. It's only usefull if you want to compare a remote file (with
>> precomputed hash) and a local file.
>
> I'll probably be going with:
>
> * check length
> * check last few bytes (begin with the same bytes but do not finish
> with them)
> * check byte-by-byte
>
> Computing a hash could be interesting in situations where there are
> lots and lots of files with the same length. Instead of having to
> compare each file with all other files of the same length, one could
> simply compute the hash by traversing it once and then compare the
> hashes instead. Of course in order to be 100% certain one would need
> to then do another byte-by-byte check again. Alternatively one could
> cash the relationships between all files, e.g. A != B and B == C
> means A != C and C! = A
>
> I can see this could be fun :-)
>
> Best regards,
>
> Frank_______________________________________________
>
> Cocoa-dev mailing list (<email_removed>)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/mikey-san
> %40bungie.org
>
> This email sent to <email_removed>
DATE : Tue Nov 20 23:48:46 2007
I implemented MD5 hashing and comparison in a file diff utility I
wrote for internal use, and I gotta say . . . it was *fast* with tens
of thousands of files of varying size. (Say, anywhere from 4KB to
dozens of megs.)
--
m-s
On 20 Nov, 2007, at 16:42, Frank Reiff wrote:
> Hi Jean-Daniel,
>
> Thanks for your response.
>
> On 16 Nov 2007, at 14:46, Jean-Daniel Dupas wrote:
>
>>
>> Le 16 nov. 07 à 14:25, Frank Reiff a écrit :
>>> Another issue is of course performance. Comparing byte-by-byte is
>>> certainly the simplest and most reliable way of doing this, but
>>> it's SLOW.. on the other hand I don't really know what the
>>> performance characteristics of an MD5, CRC32, or SHA hash are and
>>> whether or not you need to read in the whole file contents to
>>> apply them..
>>>
>>> It would thus be great if somebody, somewhere had published a
>>> ready-to-use - (BOOL) file: (NSString*) path isIdenticalTo:
>>> (NSString*) path2; method :-)
>>>
>>> I've spent the last two hours searching the web, but I haven't
>>> found anything that comes close..
>>
>> You don't have to check byte-by-byte if the two files have a
>> different size.
>> Then, comparing byte-per-byte is not so slow, as you can abort the
>> comparaison as soon as two bytes are differents.
>>
>> Using a hash method has no benefit to compare two files on the
>> disk. It's only usefull if you want to compare a remote file (with
>> precomputed hash) and a local file.
>
> I'll probably be going with:
>
> * check length
> * check last few bytes (begin with the same bytes but do not finish
> with them)
> * check byte-by-byte
>
> Computing a hash could be interesting in situations where there are
> lots and lots of files with the same length. Instead of having to
> compare each file with all other files of the same length, one could
> simply compute the hash by traversing it once and then compare the
> hashes instead. Of course in order to be 100% certain one would need
> to then do another byte-by-byte check again. Alternatively one could
> cash the relationships between all files, e.g. A != B and B == C
> means A != C and C! = A
>
> I can see this could be fun :-)
>
> Best regards,
>
> Frank_______________________________________________
>
> Cocoa-dev mailing list (<email_removed>)
>
> Please do not post admin requests or moderator comments to the list.
> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>
> Help/Unsubscribe/Update your Subscription:
> http://lists.apple.com/mailman/options/cocoa-dev/mikey-san
> %40bungie.org
>
> This email sent to <email_removed>
| Related mails | Author | Date |
|---|---|---|
| Frank Reiff | Nov 16, 14:25 | |
| Jean-Daniel Dupas | Nov 16, 14:46 | |
| matt.gough | Nov 16, 14:57 | |
| Frank Reiff | Nov 20, 22:42 | |
| Michael Watson | Nov 20, 23:48 | |
| Bill Bumgarner | Nov 21, 00:33 | |
| Jean-Daniel Dupas | Nov 21, 10:33 | |
| Bill Bumgarner | Nov 21, 10:55 | |
| Army Research Lab | Nov 21, 13:21 | |
| Frank Reiff | Nov 21, 15:23 | |
| Frank Reiff | Nov 21, 15:32 | |
| Frank Reiff | Nov 21, 15:40 |






Cocoa mail archive

