NSMapTable with C strings as keys

  • I'd like to have a dictionary using C strings as keys (because I
    already have const char* strings and would like to spare on creating
    NSString wrappers) and C structures as values.

    I thought using NSMapTable would be a good idea for this.

    Here is my code:

    // function for comparing string keys
    static BOOL MapTableKeyComparator(NSMapTable* table, const void* key1,
    const void* key2)
    {
        int result = strcmp((const char*)key1, (const char*)key2);
        return result == 0;
    }

    NSMapTable* _table;



    NSMapTableKeyCallBacks keyCallbacks = NSNonOwnedPointerMapKeyCallBacks;

    keyCallbacks.isEqual = MapTableKeyComparator;

    NSMapTableValueCallBacks valueCallbacks = NSNonOwnedPointerMapValueCallBacks;

    _table = NSCreateMapTable(keyCallbacks, valueCallbacks, 0);



    NSMapInsert(_table, name, value);



    value = NSMapGet(_table, name);

    Now, the problem is that sometimes when I try to get a value from the
    table, the MapTableKeyComparator function is not called at all, and
    NSMapGet returns NULL, thought immediate dump of the table shows that
    all previous records are perfectly present in the table. The bug
    happens sporadically, at different moments. Sometimes it happens at
    the line of code where it perfectly worked on last debug session.

    What could be wrong?
  • On May 27, 2013, at 10:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:

    > Now, the problem is that sometimes when I try to get a value from the
    > table, the MapTableKeyComparator function is not called at all, and
    > NSMapGet returns NULL, thought immediate dump of the table shows that
    > all previous records are perfectly present in the table.

    Probably because you haven’t implemented a hash function, only an equals function. I’m guessing NSMapTable’s default hash function merely hashes the key pointer itself, which means that if you pass it a different pointer to an equal C string, it won’t find anything.

    —Jens
  • Hi Jens,

    I guess you may be right. But… two questions in this regard:

    1. I thought that "isEqual" method is alternative to "hash" method,
    because searching by key and searching by hash are two mutually
    exclusive methods of looking up values, aren't they?

    2. What hash function you'd suggest in my case, that would calculate
    unsigned int on output, for C strings? Because calculating hash
    functions (such as md5) may be computationally expensive, which could
    undermine my entire idea of sparing extra few calls on creating
    NSStrings :)

    Thanks!

    On Tue, May 28, 2013 at 9:08 AM, Jens Alfke <jens...> wrote:
    >
    > On May 27, 2013, at 10:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >
    > Now, the problem is that sometimes when I try to get a value from the
    > table, the MapTableKeyComparator function is not called at all, and
    > NSMapGet returns NULL, thought immediate dump of the table shows that
    > all previous records are perfectly present in the table.
    >
    >
    > Probably because you haven’t implemented a hash function, only an equals
    > function. I’m guessing NSMapTable’s default hash function merely hashes the
    > key pointer itself, which means that if you pass it a different pointer to
    > an equal C string, it won’t find anything.
    >
    > —Jens
    >
  • Cryptographic hashes such as md5 are intensive to calculate because
    they are meant to make it difficult for an attacker to come up with
    different inputs that output identical hashes.

    For purposes of looking up strings in a hash table, a 32 bit CRC
    (Cyclic Redundancy Check) would be quick to calculate with few
    collisions.  It also doesn't require multiprecision arithmetic.

    Mike Crawford
    <mdcrawford...>

    On Mon, May 27, 2013 at 11:25 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    > Hi Jens,
    >
    > I guess you may be right. But… two questions in this regard:
    >
    > 1. I thought that "isEqual" method is alternative to "hash" method,
    > because searching by key and searching by hash are two mutually
    > exclusive methods of looking up values, aren't they?
    >
    > 2. What hash function you'd suggest in my case, that would calculate
    > unsigned int on output, for C strings? Because calculating hash
    > functions (such as md5) may be computationally expensive, which could
    > undermine my entire idea of sparing extra few calls on creating
    > NSStrings :)
    >
    > Thanks!
    >
    > On Tue, May 28, 2013 at 9:08 AM, Jens Alfke <jens...> wrote:
    >>
    >> On May 27, 2013, at 10:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >>
    >> Now, the problem is that sometimes when I try to get a value from the
    >> table, the MapTableKeyComparator function is not called at all, and
    >> NSMapGet returns NULL, thought immediate dump of the table shows that
    >> all previous records are perfectly present in the table.
    >>
    >>
    >> Probably because you haven’t implemented a hash function, only an equals
    >> function. I’m guessing NSMapTable’s default hash function merely hashes the
    >> key pointer itself, which means that if you pass it a different pointer to
    >> an equal C string, it won’t find anything.
    >>
    >> —Jens
    >>


    --
    Michael David Crawford
    mdcrawford at gmail dot com

      Custom Software Development for the iPhone and Mac OS X
      http://www.dulcineatech.com/custom-software-development/
  • Le 28 mai 2013 à 08:25, Oleg Krupnov <oleg.krupnov...> a écrit :

    > Hi Jens,
    >
    > I guess you may be right. But… two questions in this regard:
    >
    > 1. I thought that "isEqual" method is alternative to "hash" method,
    > because searching by key and searching by hash are two mutually
    > exclusive methods of looking up values, aren't they?
    >

    No, they aren't. The hash is used to speed up the lookup.
    The HashTable first uses the hash to find in which bucket the element is, and as the hash is not guarantee to be unique, it then use the isEqual method to determine what element in this bucket in the one you are looking for.

    > 2. What hash function you'd suggest in my case, that would calculate
    > unsigned int on output, for C strings? Because calculating hash
    > functions (such as md5) may be computationally expensive, which could
    > undermine my entire idea of sparing extra few calls on creating
    > NSStrings :)

    The main issue with using c string, is memory management of your keys. NSString does that using ref counting, but you will have to take care of everything if you are using C string.
    Avoiding NSString without being sure this will impact the performance is just "premature optimization".

    That said, there is 2 famous hash functions that are usually used for this kind of hashing: CityHash () and MurmurHash (<A href="http://code.google.com/p/smhasher/">http://code.google.com/p/smhasher/)

    > Thanks!
    >
    > On Tue, May 28, 2013 at 9:08 AM, Jens Alfke <jens...> wrote:
    >>
    >> On May 27, 2013, at 10:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >>
    >> Now, the problem is that sometimes when I try to get a value from the
    >> table, the MapTableKeyComparator function is not called at all, and
    >> NSMapGet returns NULL, thought immediate dump of the table shows that
    >> all previous records are perfectly present in the table.
    >>
    >>
    >> Probably because you haven’t implemented a hash function, only an equals
    >> function. I’m guessing NSMapTable’s default hash function merely hashes the
    >> key pointer itself, which means that if you pass it a different pointer to
    >> an equal C string, it won’t find anything.
    >>
    >> —Jens
    >>


    -- Jean-Daniel
  • I just made the following experiment:

    I specified a hash method for my NSMapTable, but it always returns 0.
    This seems to force the NSMapGet to always use key comparison for
    searching for elements, as hash is always "not unique".

    Is this a good idea?

    Let me think. The keys are quite short strings, like 5-10 chars.

    If we take the expense of calculating hash once for each search, then
    comparing hash values is very quick, plus a few string comparisons in
    the end if hash is not unique

    If we spare on calculating hash, we will compare strings on each
    ramification of the binary tree. Besides, it seems that isEqual method
    returns BOOL which means the keys will not be sorted, which makes
    binary tree search impossible? In this case hash seems the only sane
    solution for NSMapTable. Am I right?

    On Tue, May 28, 2013 at 10:03 AM, Jean-Daniel Dupas
    <devlists...> wrote:
    >
    > Le 28 mai 2013 à 08:25, Oleg Krupnov <oleg.krupnov...> a écrit :
    >
    >> Hi Jens,
    >>
    >> I guess you may be right. But… two questions in this regard:
    >>
    >> 1. I thought that "isEqual" method is alternative to "hash" method,
    >> because searching by key and searching by hash are two mutually
    >> exclusive methods of looking up values, aren't they?
    >>
    >
    > No, they aren't. The hash is used to speed up the lookup.
    > The HashTable first uses the hash to find in which bucket the element is, and as the hash is not guarantee to be unique, it then use the isEqual method to determine what element in this bucket in the one you are looking for.
    >
    >> 2. What hash function you'd suggest in my case, that would calculate
    >> unsigned int on output, for C strings? Because calculating hash
    >> functions (such as md5) may be computationally expensive, which could
    >> undermine my entire idea of sparing extra few calls on creating
    >> NSStrings :)
    >
    > The main issue with using c string, is memory management of your keys. NSString does that using ref counting, but you will have to take care of everything if you are using C string.
    > Avoiding NSString without being sure this will impact the performance is just "premature optimization".
    >
    > That said, there is 2 famous hash functions that are usually used for this kind of hashing: CityHash () and MurmurHash (<A href="http://code.google.com/p/smhasher/">http://code.google.com/p/smhasher/)
    >
    >> Thanks!
    >>
    >> On Tue, May 28, 2013 at 9:08 AM, Jens Alfke <jens...> wrote:
    >>>
    >>> On May 27, 2013, at 10:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >>>
    >>> Now, the problem is that sometimes when I try to get a value from the
    >>> table, the MapTableKeyComparator function is not called at all, and
    >>> NSMapGet returns NULL, thought immediate dump of the table shows that
    >>> all previous records are perfectly present in the table.
    >>>
    >>>
    >>> Probably because you haven’t implemented a hash function, only an equals
    >>> function. I’m guessing NSMapTable’s default hash function merely hashes the
    >>> key pointer itself, which means that if you pass it a different pointer to
    >>> an equal C string, it won’t find anything.
    >>>
    >>> —Jens
    >>>

    >
    > -- Jean-Daniel
    >
    >
    >
    >
  • On May 28, 2013, at 12:38 AM, Oleg Krupnov <oleg.krupnov...> wrote:

    > I just made the following experiment:
    >
    > I specified a hash method for my NSMapTable, but it always returns 0.
    > This seems to force the NSMapGet to always use key comparison for
    > searching for elements, as hash is always "not unique".
    >
    > Is this a good idea?

    If you're going to do that, why bother with an NSMapTable at all? Just store your pointers in a C array.

    >
    > Let me think. The keys are quite short strings, like 5-10 chars.

    How about making your hash function just return the first character casted to int?

    >
    > If we take the expense of calculating hash once for each search, then
    > comparing hash values is very quick, plus a few string comparisons in
    > the end if hash is not unique

    It's not useful to pursue this line of thinking without benchmarks.

    --Kyle Sluder
  • > If you're going to do that, why bother with an NSMapTable at all? Just store your pointers in a C array.

    The string pointers can be different, but they can contain identical
    string keys, resulting in identical values. I wanted to find values by
    in a more efficient way than dumb array iteration, say, like binary
    tree search which is probably used in NSDictionary or NSMapTable.

    > It's not useful to pursue this line of thinking without benchmarks.

    Now I guess you're right. I just didn't know it was going to be that
    complicated. I thought it was pure benefit and cheap. I am reverting
    to using NSDictionary.
  • On May 27, 2013, at 11:42 PM, Michael Crawford <mdcrawford...> wrote:

    > For purposes of looking up strings in a hash table, a 32 bit CRC
    > (Cyclic Redundancy Check) would be quick to calculate with few
    > collisions.  It also doesn't require multiprecision arithmetic.

    CRC is primarily a checksum, not a hash function. It's good for verifying data integrity, e.g. in a network protocol or file format, but more expensive than you’d like for a hash table. There are much faster hash functions: Wikipedia has a good list[1].

    —Jens

    [1] http://en.wikipedia.org/wiki/List_of_hash_functions#Non-cryptographic_hash_
    functions
  • On May 28, 2013, at 4:10 PM, Jens Alfke wrote:

    > CRC is primarily a checksum, not a hash function. It's good for verifying data integrity, e.g. in a network protocol or file format, but more expensive than you’d like for a hash table. There are much faster hash functions: Wikipedia has a good list[1].

    http://en.wikipedia.org/wiki/Hash_function

    Hashing with cryptographic hash functions [edit]
    Some cryptographic hash functions, such as SHA-1, have even stronger uniformity guarantees than checksums or fingerprints, and thus can provide very good general-purpose hashing functions.
    In ordinary applications, this advantage may be too small to offset their much higher cost.[5] However, this method can provide uniformly distributed hashes even when the keys are chosen by a malicious agent. This feature may help to protect services against denial of service attacks.

    I thought I saw SHA-1 being used as a general purpose hash function somewhere sort of surprising recently but I'm not remembering exactly where. Maybe if the collision resistance or hash value uniformity out weigh the performance concerns?

    Michael Hall

    trz nio.2 for OS X http://www195.pair.com/mik3hall/index.html#trz

    HalfPipe Java 6/7 shell app http://www195.pair.com/mik3hall/index.html#halfpipe

    AppConverter convert Apple jvm to openjdk apps http://www195.pair.com/mik3hall/index.html#appconverter
  • On May 28, 2013, at 5:27 PM, Michael Hall wrote:

    > I thought I saw SHA-1 being used as a general purpose hash function somewhere sort of surprising recently but I'm not remembering exactly where.

    Ah, sorry to reply to my own but maybe this was it…

    https://news.ycombinator.com/item?id=4036878
    SHA-1is still used in applications such as git as a general purpose hash function.

    Not this particular article where I saw it but I recently signed up on git and think I may of seen it's use then.

    Michael Hall

    trz nio.2 for OS X http://www195.pair.com/mik3hall/index.html#trz

    HalfPipe Java 6/7 shell app http://www195.pair.com/mik3hall/index.html#halfpipe

    AppConverter convert Apple jvm to openjdk apps http://www195.pair.com/mik3hall/index.html#appconverter
  • On 28/05/2013, at 3:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:

    > I'd like to have a dictionary using C strings as keys (because I
    > already have const char* strings and would like to spare on creating
    > NSString wrappers)

    For the sake of avoiding something you *assume* to be slow, or inefficient, you've taken the discussion in a direction that is vastly more complicated.

    Why not just create NSString wrappers? By using the -initWithBytesNoCopy:length:encoding:freeWhenDone: method you can avoid it copying the actual C string characters, it literally just becomes a thin wrapper.

    K.I.S.S.! If you can prove this approach is a problem by actual profiling, then OK, then you can talk about a more complex solution.

    --Graham
  • On May 28, 2013, at 3:39 PM, Michael Hall <mik3hall...> wrote:

    > https://news.ycombinator.com/item?id=4036878
    > SHA-1is still used in applications such as git as a general purpose hash function.
    > Not this particular article where I saw it but I recently signed up on git and think I may of seen it's use then.

    Not exactly. It’s used in many version control systems as a highly reliable checksum to produce unique IDs for revisions, particularly IDs that can’t be forged. That’s not the same as a hash function, because in an application like this, a hash collision would cause serious problems like data loss. (That’s why you use something with 160 bits of output not 32!)

    I’m sure these Git revision IDs do get used as hash keys at some point in some algorithms, but that’s not really their purpose (if you already have a SHA digest it’s cheaper to use it directly as a hash code than running its 160 bits through an in-memory hash algorithm, but it’s not necessary.)

    —Jens
  • On May 28, 2013, at 3:39 PM, Michael Hall <mik3hall...> wrote:
    > On May 28, 2013, at 5:27 PM, Michael Hall wrote:
    >> I thought I saw SHA-1 being used as a general purpose hash function somewhere sort of surprising recently but I'm not remembering exactly where.
    >
    > Ah, sorry to reply to my own but maybe this was it…
    >
    > https://news.ycombinator.com/item?id=4036878
    > SHA-1is still used in applications such as git as a general purpose hash function.
    >
    > Not this particular article where I saw it but I recently signed up on git and think I may of seen it's use then.

    For this sort of use I expect SHA-1 is chosen in part because it computes a bigger value than a typical hash-table hash. (160 bits for SHA-1 and 256+ for SHA-2, versus 32 or 64 for a typical hash table.)

    git in particular wants an ID that is as globally unique as possible so 64 bits is not enough, is computing a small number of hashes so the extra per-hash setup time for a cryptographic hash is less important, and is probably I/O bound anyway so the extra CPU time of a cryptographic hash is less important.

    --
    Greg Parker    <gparker...>    Runtime Wrangler
  • > Why not just create NSString wrappers? By using the -initWithBytesNoCopy:length:encoding:freeWhenDone: method you can avoid it copying the actual C string characters, it literally just becomes a thin wrapper.

    In my case it's more about extra calls than extra memory but thanks! Didn't know about this.

    > For the sake of avoiding something you *assume* to be slow, or inefficient, you've taken the discussion in a direction that is vastly more complicated.

    The code in question is frequently used in many places so I think it's worth optimization.

    This KISS school of thought has brought Microsoft to building the slow and inefficient .NET technology, and it's had a hard time persuading everybody and his dog that hardware is evolving faster than they right inefficient code, so it was "ok", but ultimately this approach failed.

    While I generally agree that premature optimization is evil, I do not understand why I cannot or shouldn't always keep in mind the cost of things I am using, and consider more efficient approaches, especially when they are simple. (This time it has turned out not simple and not even efficient, so I changed my mind).

    The profiler is not a panacea; when you have hundreds of small, not-so-efficient pieces of code like this, all you see in profiler is a long list of small consumers, totaling in heavy use of objc runtime calls.

    On May 29, 2013, at 1:46 AM, Graham Cox <graham.cox...> wrote:

    >
    > On 28/05/2013, at 3:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >
    >> I'd like to have a dictionary using C strings as keys (because I
    >> already have const char* strings and would like to spare on creating
    >> NSString wrappers)
    >
    >
    >
    >
    > K.I.S.S.! If you can prove this approach is a problem by actual profiling, then OK, then you can talk about a more complex solution.
    >
    >
    > --Graham
    >
    >
  • On May 28, 2013, at 5:55 PM, Greg Parker wrote:

    > On May 28, 2013, at 3:39 PM, Michael Hall <mik3hall...> wrote:
    >> On May 28, 2013, at 5:27 PM, Michael Hall wrote:
    >>> I thought I saw SHA-1 being used as a general purpose hash function somewhere sort of surprising recently but I'm not remembering exactly where.
    >>
    >> Ah, sorry to reply to my own but maybe this was it…
    >>
    >> https://news.ycombinator.com/item?id=4036878
    >> SHA-1is still used in applications such as git as a general purpose hash function.
    >>
    >> Not this particular article where I saw it but I recently signed up on git and think I may of seen it's use then.
    >
    > For this sort of use I expect SHA-1 is chosen in part because it computes a bigger value than a typical hash-table hash. (160 bits for SHA-1 and 256+ for SHA-2, versus 32 or 64 for a typical hash table.)
    >
    > git in particular wants an ID that is as globally unique as possible so 64 bits is not enough, is computing a small number of hashes so the extra per-hash setup time for a cryptographic hash is less important, and is probably I/O bound anyway so the extra CPU time of a cryptographic hash is less important.

    This is sort of off-topic but I think this is a little tougher to be sure on isn't it?
    2*64 if a pretty big number. DES is sort of got on the outs because they were coming up with attacks that could go after it non-stop brute force in about 2*56 tries. The birthday paradox may apply here too where collisions are much more likely at something a lot less than 2*64. I don't know. SHA-1 for purposes like like Jens suggested, to avoid forgeability, is also I thought sort of on the outs due to progress in attackers figuring out how to cause collisions. But it has been around a long time and I suppose converting to SHA-2 or something else for something like git would be a pretty large effort. If attackers aren't the concern I think it might still make a slow and large but tried and true as far as it goes general purpose hash, with usually good attributes for collision resistance and uniform distribution.

    Michael Hall

    trz nio.2 for OS X http://www195.pair.com/mik3hall/index.html#trz

    HalfPipe Java 6/7 shell app http://www195.pair.com/mik3hall/index.html#halfpipe

    AppConverter convert Apple jvm to openjdk apps http://www195.pair.com/mik3hall/index.html#appconverter
  • Le 29 mai 2013 à 00:46, Graham Cox <graham.cox...> a écrit :

    >
    > On 28/05/2013, at 3:46 PM, Oleg Krupnov <oleg.krupnov...> wrote:
    >
    >> I'd like to have a dictionary using C strings as keys (because I
    >> already have const char* strings and would like to spare on creating
    >> NSString wrappers)
    >
    >
    > For the sake of avoiding something you *assume* to be slow, or inefficient, you've taken the discussion in a direction that is vastly more complicated.
    >
    > Why not just create NSString wrappers? By using the -initWithBytesNoCopy:length:encoding:freeWhenDone: method you can avoid it copying the actual C string characters, it literally just becomes a thin wrapper.
    >
    > K.I.S.S.! If you can prove this approach is a problem by actual profiling, then OK, then you can talk about a more complex solution.

    It may be a thin wrapper. Unlike with NSData that actually wraps the passed buffer, NSString does not make any guarantee about what it does.
    My experience is that most of the times, NSString with actually copy the bytes and create its own internal representation.

    That said, even if NSString creates an internal representation, I agree that trying a more complex approach without profiling data is pointless.

    -- Jean-Daniel
  • Le 29 mai 2013 à 06:14, Oleg Krupnov <oleg.krupnov...> a écrit :

    >> Why not just create NSString wrappers? By using the -initWithBytesNoCopy:length:encoding:freeWhenDone: method you can avoid it copying the actual C string characters, it literally just becomes a thin wrapper.
    >
    > In my case it's more about extra calls than extra memory but thanks! Didn't know about this.
    >

    If method dispatch time is your main problem, just use CFString and CFDictionary API.

    -- Jean-Daniel
  • On May 28, 2013, at 9:14 PM, Oleg Krupnov <oleg.krupnov...> wrote:

    > The code in question is frequently used in many places so I think it's worth optimization.

    Did you actually profile the app and find that a lot of time is spent in this code?

    > While I generally agree that premature optimization is evil, I do not understand why I cannot or shouldn't always keep in mind the cost of things I am using, and consider more efficient approaches, especially when they are simple. (This time it has turned out not simple and not even efficient, so I changed my mind).

    Because the most important measure of time is the time you have available to develop the app. Any time you spend optimizing something is time you don’t have available to optimize a more significant piece of code, or add features, or fix bugs. Also, optimized code is often more complex than KISS code, making it less reliable, and harder to debug and improve.

    Sometimes it’s really obvious that something is going to be a major bottleneck. (Or it’s not obvious, but you’re enough of an expert in the domain to know from experience that it will be.) And sometimes it’s almost as easy to do something efficiently than to do it slowly. But most often it does pay off to do it the simple way first and then go back and optimize only the things that are important.

    —Jens
  • On May 28, 2013, at 9:14 PM, Oleg Krupnov <oleg.krupnov...> wrote:

    > The profiler is not a panacea; when you have hundreds of small, not-so-efficient pieces of code like this, all you see in profiler is a long list of small consumers, totaling in heavy use of objc runtime calls.

    Oh, also: What you’ve said above is a good reason for factoring out common code instead of repeating it in multiple places (aka “Don’t Repeat Yourself” or DRY). If you write this string lookup code once as a function/method you can use in many places, then it will show up as a hot-spot in the profiler if it’s slow, and then optimizing it in one place will speed up your app. If you copy/paste the code instead, the time will be split up among multiple copies of it so it’s less likely to show up, and even if it does show up you’d have to implement the optimization in many places.

    (Unfortunately if you use inline functions you can have cases where you did only write the code once but the compiler copied it into lots of places, making it not show up as a hot spot. That’s a reason to be really careful about inlining; it can backfire on you. Mostly that's only a problem in C++ though, where inlining and templates are used a lot and it’s easy to write code that looks short and simple but explodes into a lot of instructions. I’ve had the “fun” of trying to optimize code like that before.)

    —Jens
  • On 29 May 2013, at 14:14, Oleg Krupnov <oleg.krupnov...> wrote:

    > While I generally agree that premature optimization is evil,

    That seems to come out of a belief that well-structured code is code that runs poorly (this belief came out of an IBM system of the 50s/60s that had really poorly running subroutine calls - so the recommendation was not to structure things).

    In one sense I think we should always be optimizing code. That is in the sense that we write well structured and correct code. Usually optimization naturally follows. There seems to be a school of thought that writing poorly structured code must be optimized, but I have usually seen that most badly running code is a result of poor structure and refactoring makes it both structured and optimized. Once you have got to such a structured state then you might be able to see where further optimizations can be made to profiled areas of code which might result in more complex, but still structured and optimized. Where something simple is replaced by a more complex algorithm it should be well documented.

    That is the subject of Dijkstra's book "A Discipline of Programming". Once a well-structured algorithm is developed, better optimizations can then be seen. This is the 'engineering' part of software development - making software run practically (not the drawing diagrams part). For particular computations (that is where the algorithm works on particular data sets), an algorithm might not be optimal, so needs to cater to those cases.

    In summary - we should be producing well-structured code early on. This will usually result in well-optimized code. However, this is not a guarantee and particular cases may need to be addressed later (that is the premature optimization case). But we should always aim to produce well-structured code.
  • On May 29, 2013, at 6:30 PM, Ian Joyner <ianjoyner...> wrote:

    > That seems to come out of a belief that well-structured code is code that runs poorly

    No, it’s a paraphrase of a famous quote by Don Knuth ("We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”[1]) He later attributed this to C.A.R. Hoare.

    The main points behind this are, in my opinion, that: (a) you don’t ever have time to optimize the entire program, and (b) it’s often very unintuitive which parts of the code are bottlenecks. Additionally I find that (c) lots of the code I write ends up being scaffolding that’s going to get replaced anyway later on during development; and (d) heavily optimized code is often harder to maintain.

    A more extreme version of this statement is the Ward Cunningham's mantra “Do The Simplest Thing That Could Possibly Work”[2].

    I think these are two of the best lessons I’ve learned as I progressed in my craft. They've made me a lot more productive. There’s no point in optimizing something that never gets finished; and getting rat-holed into tweaking tiny details was really preventing me from getting things to the point where they were useable at all.

    —Jens

    [1] http://en.wikipedia.org/wiki/Program_optimization#When_to_optimize
    [2] http://en.wikiquote.org/wiki/Ward_Cunningham#The_Simplest_Thing_that_Could_
    Possibly_Work
  • I'm not disagreeing with anything about knowing/optimizing your real
    bottlenecks.

    But I did do hash table benchmarks a few months back:
    http://playcontrol.net/opensource/LuaHashMap/benchmarks.html

    CFDictionary I did not formally do in the benchmark, but I did run on
    the side for curiosity. I found that the C-string to CFString
    conversion ended up putting it at the bottom of the list in terms of
    performance.

    -Eric
    --
    Beginning iPhone Games Development
    http://playcontrol.net/iphonegamebook/
  • On May 29, 2013, at 10:29 PM, Eric Wing <ewmailing...> wrote:

    > But I did do hash table benchmarks a few months back:
    > http://playcontrol.net/opensource/LuaHashMap/benchmarks.html

    Perhaps off topic, but I wonder if it would be possible to alter your line charts so that those circles that appear periodically along the lines are actually a different shape for each line instead of always being circles. This wouldn't alter the visual appearance of the charts too much, and would make them much easier to read for those who suffer from colorblindness and thus have trouble differentiating using color alone (I particularly find it difficult to distinguish the Tcl line from the Python line in the chart at the top of the linked page without the use of Digital ColorMeter).

    Charles
  • On May 29, 2013, at 8:29 PM, Eric Wing <ewmailing...> wrote:

    > CFDictionary I did not formally do in the benchmark, but I did run on
    > the side for curiosity. I found that the C-string to CFString
    > conversion ended up putting it at the bottom of the list in terms of
    > performance.

    It seems unfair to include the C-string-to-object conversion time in the benchmark, if what you’re trying to measure is hash-table performance. Did you include that in the benchmarks for the other languages too? (And using real Unicode-savvy string objects — I know Python has both types)?

    —jens
  • On 5/29/13, Charles Srstka <cocoadev...> wrote:
    > On May 29, 2013, at 10:29 PM, Eric Wing <ewmailing...> wrote:
    >
    >> But I did do hash table benchmarks a few months back:
    >> http://playcontrol.net/opensource/LuaHashMap/benchmarks.html
    >
    > Perhaps off topic, but I wonder if it would be possible to alter your line
    > charts so that those circles that appear periodically along the lines are
    > actually a different shape for each line instead of always being circles.
    > This wouldn't alter the visual appearance of the charts too much, and would
    > make them much easier to read for those who suffer from colorblindness and
    > thus have trouble differentiating using color alone (I particularly find it
    > difficult to distinguish the Tcl line from the Python line in the chart at
    > the top of the linked page without the use of Digital ColorMeter).
    >

    I think it should be possible, but I didn't really understand the
    chart generation code. I was reusing tools from the original benchmark
    and tried not to spend as much time on it as I did. All the chart
    generation is done in HTML and the raw data points are actually
    embedded in the pages (they are not bitmap images). So all the
    original data is accessible/scrape-able from the page.

    -Eric
    --
    Beginning iPhone Games Development
    http://playcontrol.net/iphonegamebook/
  • On 5/29/13, Jens Alfke <jens...> wrote:
    >
    > On May 29, 2013, at 8:29 PM, Eric Wing <ewmailing...> wrote:
    >
    >> CFDictionary I did not formally do in the benchmark, but I did run on
    >> the side for curiosity. I found that the C-string to CFString
    >> conversion ended up putting it at the bottom of the list in terms of
    >> performance.
    >
    > It seems unfair to include the C-string-to-object conversion time in the
    > benchmark, if what you’re trying to measure is hash-table performance. Did
    > you include that in the benchmarks for the other languages too? (And using
    > real Unicode-savvy string objects — I know Python has both types)?
    >
    > —jens

    In my introduction, I was fairly adamant that time lost to impedance
    mis-match should be measured because my interest was real performance
    in using these libraries for projects based in C. That conversion time
    is non-trivial.

    And yes, all languages I benchmarked had to go through the same rules
    (i.e. deal with const char*). This is why C++ std::string actually did
    not fare so well compared to some of the other languages. (And the
    magical optimizer did not come to the rescue.) For Python, I was not
    aware of Unicode vs. other APIs. Since the focus was C, I wasn't
    really thinking about Unicode. I used PyDict_SetItemString to insert a
    string (the source code is linked on that page somewhere).

    Since I also did integer hash benchmarks, you can get an idea of pure
    hashing performance from those benchmarks since there is little
    conversion overhead in most of those cases. (The exception is Perl
    which seems to convert everything to strings which I did not expect.)

    -Eric
    --
    Beginning iPhone Games Development
    http://playcontrol.net/iphonegamebook/
  • Le 30 mai 2013 à 08:35, Eric Wing <ewmailing...> a écrit :

    > On 5/29/13, Jens Alfke <jens...> wrote:
    >>
    >> On May 29, 2013, at 8:29 PM, Eric Wing <ewmailing...> wrote:
    >>
    >>> CFDictionary I did not formally do in the benchmark, but I did run on
    >>> the side for curiosity. I found that the C-string to CFString
    >>> conversion ended up putting it at the bottom of the list in terms of
    >>> performance.
    >>
    >> It seems unfair to include the C-string-to-object conversion time in the
    >> benchmark, if what you’re trying to measure is hash-table performance. Did
    >> you include that in the benchmarks for the other languages too? (And using
    >> real Unicode-savvy string objects — I know Python has both types)?
    >>
    >> —jens
    >
    > In my introduction, I was fairly adamant that time lost to impedance
    > mis-match should be measured because my interest was real performance
    > in using these libraries for projects based in C. That conversion time
    > is non-trivial.
    >
    > And yes, all languages I benchmarked had to go through the same rules
    > (i.e. deal with const char*). This is why C++ std::string actually did
    > not fare so well compared to some of the other languages. (And the
    > magical optimizer did not come to the rescue.) For Python, I was not
    > aware of Unicode vs. other APIs. Since the focus was C, I wasn't
    > really thinking about Unicode. I used PyDict_SetItemString to insert a
    > string (the source code is linked on that page somewhere).
    >

    I follow Jens on this one. Nothing prevent you to use CFDictionary with a C string directly or any other type.
    You don't have to create CFString.

    -- Jean-Daniel
  • On 30 May 2013, at 12:33, Jens Alfke <jens...> wrote:
    >
    > On May 29, 2013, at 6:30 PM, Ian Joyner <ianjoyner...> wrote:
    >
    >> That seems to come out of a belief that well-structured code is code that runs poorly
    >
    > No, it’s a paraphrase of a famous quote by Don Knuth ("We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.”[1]) He later attributed this to C.A.R. Hoare.

    OK, thanks, it was badly put on my part. I have Knuth's paper and will give it another read (after rereading Structured Programming (Dahl, Dijkstra, Hoare) which Knuth says will change your life).

    What I am trying to point out though is that there is a misapprehension that premature optimization means writing structured code early on so don't structure it because you can always go and clean it up later. Rather I think we should write well-structured code as we go. Often that will result in efficient code (invariant statements not being in loops, etc), but that is not what is meant by premature optimization.

    This kind of optimization is more about changing the algorithm than the code. We should beware not to lump well-structured code writing with premature optimization just because well-structured code usually ends up being optimized code. Well-structured code also ends up being more correct and more easily changed and refactored.
  • On May 30, 2013, at 3:52 AM, Ian Joyner <ianjoyner...> wrote:

    > What I am trying to point out though is that there is a misapprehension that premature optimization means writing structured code early on so don't structure it because you can always go and clean it up later. Rather I think we should write well-structured code as we go.

    I agree 100% with that. Structured code is easier to benchmark and optimize later on, anyway — as I said, if you repeat a piece of code ten times, none of the ten instances may show up individually as hot spots, whereas if you called a common function/method in ten places, it may show up and then be easy to fix.

    I would add, though, that the perfect structure (class hierarchy, API, factoring…) may not be apparent right away, especially since you’ll often end up refactoring as you go along. So trying too hard to structure code as you’re initially writing it can end up being a rathole.

    —Jens
  • On May 29, 2013, at 11:35 PM, Eric Wing <ewmailing...> wrote:

    > In my introduction, I was fairly adamant that time lost to impedance
    > mis-match should be measured because my interest was real performance
    > in using these libraries for projects based in C. That conversion time
    > is non-trivial.

    If you’re coding in C, I don’t think it makes sense to consider other languages (except maybe C++) just for their hashtable implementations! There’s inevitably going to be impedance mismatch: not just with data conversion but also memory management, since most of the languages you tested are garbage collected. That means you now have two heaps (a GC heap embedded inside your malloc heap) and extra memory usage and cost for running the GC. Not to mention the bloat on your app’s code size.

    Surely there are enough reusable C-based hashtable libraries that you don’t need to embed a whole Lua or Python interpreter just for that purpose. (And if not, it’s really easy to write one.)

    —Jens
  • On May 30, 2013, at 3:52 AM, Ian Joyner <ianjoyner...> wrote:

    > What I am trying to point out though is that there is a misapprehension that premature optimization means writing structured code early on so don't structure it because you can always go and clean it up later. Rather I think we should write well-structured code as we go.

    I agree 100% with that. Structured code is easier to benchmark and optimize later on, anyway — as I said, if you repeat a piece of code ten times, none of the ten instances may show up individually as hot spots, whereas if you called a common function/method in ten places, it may show up and then be easy to fix.

    I would add, though, that the perfect structure (class hierarchy, API, factoring…) may not be apparent right away, especially since you’ll often end up refactoring as you go along. So trying too hard to structure code as you’re initially writing it can end up being a rathole.

    —Jens
  • On 5/30/13, Jens Alfke <jens...> wrote:
    >
    > On May 29, 2013, at 11:35 PM, Eric Wing <ewmailing...> wrote:
    >
    >> In my introduction, I was fairly adamant that time lost to impedance
    >> mis-match should be measured because my interest was real performance
    >> in using these libraries for projects based in C. That conversion time
    >> is non-trivial.
    >
    > If you’re coding in C, I don’t think it makes sense to consider other
    > languages (except maybe C++) just for their hashtable implementations!
    > There’s inevitably going to be impedance mismatch: not just with data
    > conversion but also memory management, since most of the languages you
    > tested are garbage collected. That means you now have two heaps (a GC heap
    > embedded inside your malloc heap) and extra memory usage and cost for
    > running the GC. Not to mention the bloat on your app’s code size.
    >
    > Surely there are enough reusable C-based hashtable libraries that you don’t
    > need to embed a whole Lua or Python interpreter just for that purpose. (And
    > if not, it’s really easy to write one.)
    >
    > —Jens

    So that's all in the introduction too. But that's all conjecture.
    Performance tuning requires actual measurement/benchmarking. The
    question was to put it to the test. The results most people found
    surprising because a lot of conventional wisdom did not hold up.

    -Eric
    --
    Beginning iPhone Games Development
    http://playcontrol.net/iphonegamebook/
  • On 31 May 2013, at 01:43, Jens Alfke <jens...> wrote:
    >
    > On May 30, 2013, at 3:52 AM, Ian Joyner <ianjoyner...> wrote:
    >
    >> What I am trying to point out though is that there is a misapprehension that premature optimization means writing structured code early on so don't structure it because you can always go and clean it up later. Rather I think we should write well-structured code as we go.
    >
    > I agree 100% with that. Structured code is easier to benchmark and optimize later on, anyway — as I said, if you repeat a piece of code ten times, none of the ten instances may show up individually as hot spots, whereas if you called a common function/method in ten places, it may show up and then be easy to fix.
    >
    > I would add, though, that the perfect structure (class hierarchy, API, factoring…) may not be apparent right away, especially since you’ll often end up refactoring as you go along. So trying too hard to structure code as you’re initially writing it can end up being a rathole.

    Absolutely. It's often the case that we try to get everything ultra-structured before writing any code. This can be a writer's block. We often have to use the process of coding as an exploratory process. That is the very essence of software. It also means that software can be refined as we go. But often we need to know when to stop.

    Structured (and OO) languages directly support the paradigms. This seems not well understood since we attempt to structure code (code is a term I don't like) by the use of diagrams because we treat the act of programming as translating this into a machine-oriented artefact. But in fact a text-based programming language is often more malleable than such diagrams (in terms of both tool and mental attitudes that a diagram is fixed, even if a programmer comes up with better designs).
previous month may 2013 next month
MTWTFSS
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
Go to today