inconsistent behavior of NSString's localizedCaseInsensitiveCompare

  • Hello everyone,

    I'm using NSString's localizedCaseInsensitiveCompare to maintain a data structure sorted by some keys, which are strings that will be displayed to the end-user. For this question it's enough to think of an array of sorted strings, on which binary searches are run. The problem is that the localized string comparison method produces return values that are inconsistent.

    For example, consider the following sorted list of strings:

    "aaa" < "laso" < "lasso" < "zzz"

    I would expect that any new string I want to insert into this list would compare so that it has just one point of insertion. But this is not true if I test "laßt", where the German Eszett "ß" character sounds/behaves like "ss":

    (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"aaa"]
    $16 = 1
    (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"laso"]
    $17 = -1
    (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"lasso"]
    $18 = 1
    (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"zzz"]
    $19 = -1

    So, when using a binary search, I get different answers depending on the other strings in the list!

    I've tried CFStringCompareWithOptionsAndLocale with some flags (kCFCompareCaseInsensitive | kCFCompareLocalized | kCFCompareForcedOrdering), but it has the same quirk.

    This seems nutty to me. Surely a single string should have a single proper sort ordering. Are my intuitions misplaced, or is this a bug?

    Thanks for any thoughts on this matter,
    ~Martin
  • On 05.05.12 23:07, Martin Wierschin wrote:
    > So, when using a binary search, I get different answers depending on the other strings in the list!

    Seems to work for me:

    On 10.7.3, Xcode 4.3.1, for both locale en_US and locale de_AT I get:

    (lldb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"aaa"]
    (NSComparisonResult) $4 = 1
    (lldb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"laso"]
    (NSComparisonResult) $5 = 1
    (lldb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"lasso"]
    (NSComparisonResult) $6 = 1
    (lldb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"zzz"]
    (NSComparisonResult) $7 = -1
    (lldb)

    Are you sure your debugger isn't lying to you? Maybe you have category on
    NSString that does that?

    Regards
    Markus
    --
    __________________________________________
    Markus Spoettl
  • On May 5, 2012, at 4:46 PM, Markus Spoettl wrote:

    > On 05.05.12 23:07, Martin Wierschin wrote:
    >> So, when using a binary search, I get different answers depending on the other strings in the list!
    >
    > Seems to work for me:
    >
    > On 10.7.3, Xcode 4.3.1, for both locale en_US and locale de_AT I get:
    >
    > [...]
    >
    > Are you sure your debugger isn't lying to you? Maybe you have category on NSString that does that?

    I was able to reproduce the issue that Martin reported.  I had to set my system Region to Germany.  It didn't happen with it set to United States.  Also, I had to start a new instance of Python, which is where I did my testing.

    Tested on Mac OS X 10.6.8.

    Markus, how did you set the locale?

    Regards,
    Ken
  • On Sat, May 5, 2012 at 4:46 PM, Markus Spoettl <ms_lists...> wrote:
    > On 05.05.12 23:07, Martin Wierschin wrote:
    >>
    >> So, when using a binary search, I get different answers depending on the
    >> other strings in the list!
    >
    >
    > Seems to work for me:

    I was using F-Script to investigate this, but I'm seeing the same
    thing (locale: en_US):

    # a := {'aaa', 'laso', 'lasso', 'zzz'}

    # 'laßt' localizedCaseInsensitiveCompare:@a
    {1, -1, 1, -1}

    Interestingly enough...

    # 'laßt' localizedCompare:@a
    {1, 1, 1, -1}

    So if there's a bug, it appears to be in localizedCaseInsensitiveCompare:
  • On May 5, 2012, at 14:07 , Martin Wierschin wrote:

    > For example, consider the following sorted list of strings:
    >
    > "aaa" < "laso" < "lasso" < "zzz"
    >
    > I would expect that any new string I want to insert into this list would compare so that it has just one point of insertion. But this is not true if I test "laßt", where the German Eszett "ß" character sounds/behaves like "ss":
    >
    > (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"aaa"]
    > $16 = 1
    > (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"laso"]
    > $17 = -1
    > (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"lasso"]
    > $18 = 1
    > (gdb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"zzz"]
    > $19 = -1
    >
    > So, when using a binary search, I get different answers depending on the other strings in the list!

    I think the answer lies in the current language in the locale.

    "ß" within a string probably compares equal to "ss" at the corresponding position, independently of the language. (This makes sense, I think.) Therefore "laßt" > "lasso" always.

    However, when the second word doesn't have "ss" in corresponding position, then the order is determined by pure character collating sequence for the language. In your case (which I'm guessing is English), 'ß' < 's'. In Markus's case (which I'm guessing is German), 'ß' > 's'.

    (Or, it might be the other way around. The character collating sequence might depend on the locale rather than the language. In that case, I'd guess your locale was US even if your language was German, while Markus has German in the German locale.)

    If I'm right, then sort order and search order are different, and you can't expect to use a binary search here. (Not unless you originally used pair-wise 'localizedCaseInsensitiveCompare' to manually sort the list of strings.)
  • > Are you sure your debugger isn't lying to you?

    The code behaves the same in the built application.

    > Maybe you have category on NSString that does that?

    No, but I also tried the CFStringCompareWithOptionsAndLocale with the same results. I suppose that could call through to an NSString override, but unlikely, and in this case I don't have such an override.

    > Tested on Mac OS X 10.6.8.

    I tested on both 10.6.8 and 10.7.3, with the same results.

    My current/primary locale identifier is "en_US", though I do have German as a secondary preferred language in my system preferences.

    ~Martin
  • > If I'm right, then sort order and search order are different, and you can't expect to use a binary search here. (Not unless you originally used pair-wise 'localizedCaseInsensitiveCompare' to manually sort the list of strings.)

    All comparisons, including those used for the original sorting, were done using the same comparison method/locale. And the sort order and search order need to be same, otherwise any algorithms that rely on the transitive nature of comparisons (eg: quicksort, binary search, etc) will fail.

    ~Martin
  • On May 5, 2012, at 3:43 PM, Martin Wierschin <martin...> wrote:

    >> If I'm right, then sort order and search order are different, and you can't expect to use a binary search here. (Not unless you originally used pair-wise 'localizedCaseInsensitiveCompare' to manually sort the list of strings.)
    >
    > All comparisons, including those used for the original sorting, were done using the same comparison method/locale. And the sort order and search order need to be same, otherwise any algorithms that rely on the transitive nature of comparisons (eg: quicksort, binary search, etc) will fail.

    If I understand Quincey correctly, that's exactly what he's saying: the semantics of localizedCaseInsensitiveCompare: might be such that it is not appropriate for such algorithms.

    i18n is tricky. It might be worth consulting the Unicode docs on collation.

    --Kyle Sluder
  • On May 5, 2012, at 3:18 PM, Quincey Morris wrote:

    > "ß" within a string probably compares equal to "ss" at the corresponding position, independently of the language. (This makes sense, I think.) Therefore "laßt" > "lasso" always.

    I don’t know about this specific case, but these rules definitely vary by locale — there are cases where two languages use the same letter but disagree about how it sorts. (For example, the rules for sorting “LL” in Spanish are not the same as in English.)

    > However, when the second word doesn't have "ss" in corresponding position, then the order is determined by pure character collating sequence for the language. In your case (which I'm guessing is English), 'ß' < 's'. In Markus's case (which I'm guessing is German), 'ß' > 's'.

    I think this must be a bug in the collation implementation for the German locale. To be useful for sorting and searching, a comparison function *has* to obey transitivity, and this example is breaking it.

    —Jens
  • On May 5, 2012, at 15:43 , Martin Wierschin wrote:

    > All comparisons, including those used for the original sorting, were done using the same comparison method/locale. And the sort order and search order need to be same, otherwise any algorithms that rely on the transitive nature of comparisons (eg: quicksort, binary search, etc) will fail.

    Well, I guess you demonstrated that your particular comparison wasn't transitive.

    FWIW, I just tried it in the debugger on my US-locale 10.7.3, and I get:

    (lldb) p (NSComparisonResult)[@"laßt" localizedCaseInsensitiveCompare:@"laso"]
    (NSComparisonResult) $11 = 1

    It's starting to look like a bug in 10.6.8, except that I have no idea why you & I got different answers in 10.7.3.
  • On Sat, May 5, 2012 at 6:53 PM, Jens Alfke <jens...> wrote:
    >
    > On May 5, 2012, at 3:18 PM, Quincey Morris wrote:
    >
    >> "ß" within a string probably compares equal to "ss" at the corresponding position, independently of the language. (This makes sense, I think.) Therefore "laßt" > "lasso" always.
    >
    > I don’t know about this specific case, but these rules definitely vary by locale — there are cases where two languages use the same letter but disagree about how it sorts. (For example, the rules for sorting “LL” in Spanish are not the same as in English.)

    I don't know a whole lot about Cocoa support for locales. Is
    stringByFoldingWithOptions:locale: similar to strxfrm_l? Because if so
    I would feel reasonable comfortable doing a binary search on strings
    ordered after being transformed by this function.

    But they wouldn't be suitable for display. So you'd have to store the
    original string somewhere.
  • On May 5, 2012, at 4:51 PM, Kyle Sluder wrote:

    > If I understand Quincey correctly, that's exactly what he's saying: the semantics of localizedCaseInsensitiveCompare: might be such that it is not appropriate for such algorithms.

    But that doesn’t make sense, because the main purpose of -compare: methods is to use for sorting, for methods like -sortUsingSelector: or for NSSortDescriptor. And using -localizedCaseInsensitiveCompare: is pretty common for lists displayed in a UI. So I maintain this is a bug.

    —Jens
  • On May 5, 2012, at 19:45 , Jens Alfke wrote:

    > On May 5, 2012, at 4:51 PM, Kyle Sluder wrote:
    >
    >> If I understand Quincey correctly, that's exactly what he's saying: the semantics of localizedCaseInsensitiveCompare: might be such that it is not appropriate for such algorithms.
    >
    > But that doesn’t make sense, because the main purpose of -compare: methods is to use for sorting, for methods like -sortUsingSelector: or for NSSortDescriptor. And using -localizedCaseInsensitiveCompare: is pretty common for lists displayed in a UI. So I maintain this is a bug.

    It could easily be a bug.

    However, even in pure ASCII strings, 'localizedCaseInsensitiveCompare:' doesn't support sorting via (say) a binary search. Strings such as "a" and "A" are *equal* according to this method. Binary search doesn't work if sorted elements are equal according to the comparison function.

    The kind of comparison function needed for what we think of as case-insensitive-sorted lists of strings needs a function where case isn't ignored, but rather where the collating sequence puts case-insensitive-equal strings next to each other with a definite order.

    The NSString option that apparently causes this is 'NSForcedOrderingSearch', but the documentation doesn't really say whether this option is assumed by 'localizedCaseInsensitiveCompare:' or not. (The documentation for 'caseInsensitiveCompare:' implies not.)

    When we throw a 'ß' into the mix, things just get harder. My understanding is that, in German, the uppercase of "laßt" is "LASST". So, "laßt" and "lasst" should be equal according to 'localizedCaseInsensitiveCompare:' in a German locale, shouldn't they? But they aren't equal on my US-locale system 10.7.3 -- I tried it.

    So even if we get the right compare options, 'localizedCaseInsensitiveCompare:' may not be viable for strings *not* in the language of the current locale.

    Note that all of this is prior even to the question of transitivity. If 'localizedCaseInsensitiveCompare:' isn't suitable for sorting lists of mixed-language strings "case-insensitively"**, then delving into its transitivity problems seems beside the point.

    ** Incidentally, it's no help arguing that "laßt" in this case isn't German but just some string with an ess-zett. Case-(in)sensitivity is, in general, a language-dependent concept.
  • On 5/5/12 11:56 PM, Ken Thomases wrote:
    > On May 5, 2012, at 4:46 PM, Markus Spoettl wrote:
    > Markus, how did you set the locale?

    I used the system preferences, setting it to de_AT, restarting Xcode after that
    and redoing the test.

    Regards
    Markus
    --
    __________________________________________
    Markus Spoettl
  • On 5/6/12 12:18 AM, Quincey Morris wrote:
    > However, when the second word doesn't have "ss" in corresponding position,
    > then the order is determined by pure character collating sequence for the
    > language. In your case (which I'm guessing is English), 'ß'<  's'. In
    > Markus's case (which I'm guessing is German), 'ß'>  's'.

    I tried both en_US and de_AT. Both yield the same result for me under 10.7.3.

    I think the whole thing must have to do with the fact that ß doesn't have an
    upper-case version and the compare is case insensitive. I vaguely remember being
    hit by the same thing once, though I can't remember when it was and I'm pretty
    sure a system update fixed it. It must have been in the early 10.5 or 10.6 releases.

    Regards
    Markus
    --
    __________________________________________
    Markus Spoettl
  • On May 5, 2012, at 10:06 PM, Quincey Morris wrote:

    > However, even in pure ASCII strings, 'localizedCaseInsensitiveCompare:' doesn't support sorting via (say) a binary search. Strings such as "a" and "A" are *equal* according to this method. Binary search doesn't work if sorted elements are equal according to the comparison function.

    I disagree. Binary search of a sorted array works fine if there are different keys that sort as equal; it’s just that it’ll find *one* of the equal keys, and you’d then have to look at its neighbors to find all the others. If you mean tree-based searching, that works fine too; for example, HFS+ stores the disk catalog as a b-tree and of course uses a case-insensitive comparison by default. (OK, HFS+ doesn’t allow multiple items with equivalent keys [paths], but there are b-tree storage libraries like Berkeley DB that do.)

    > The NSString option that apparently causes this is 'NSForcedOrderingSearch', but the documentation doesn't really say whether this option is assumed by 'localizedCaseInsensitiveCompare:' or not. (The documentation for 'caseInsensitiveCompare:' implies not.)

    The docs for NSForcedOrderingSearch read: "Comparisons are forced to return either NSOrderedAscending or NSOrderedDescending if the strings are equivalent but not strictly equal.” In other words, it only returns 0 if the strings are identical This is definitely not true of -caseInsensitiveCompare, localized or not.

    In other words, if you want a stable case-insensitive sort (where equivalent strings always appear in a certain fixed order) you can’t just use -caseInsensitiveCompare:; you have to use one of the -compare: methods that takes options, and enable NSForcedOrderingSearch.

    —Jens
  • >>> If I'm right, then sort order and search order are different, and you can't expect to use a binary search here. (Not unless you originally used pair-wise 'localizedCaseInsensitiveCompare' to manually sort the list of strings.)
    >>
    >> All comparisons, including those used for the original sorting, were done using the same comparison method/locale. And the sort order and search order need to be same, otherwise any algorithms that rely on the transitive nature of comparisons (eg: quicksort, binary search, etc) will fail.
    >
    > If I understand Quincey correctly, that's exactly what he's saying: the semantics of localizedCaseInsensitiveCompare: might be such that it is not appropriate for such algorithms.

    As Jens mentioned, that doesn't make any sense. What good is a localized comparison method if not for sorting a list for display? I suppose you could be implying that Cocoa's sorting methods aren't optimized to assume comparisons are transitive (or special cases problematic strings/comparisons), but that seems quite unlikely.

    Actually, it's possible to test indirectly. I bet we could construct a set of strings that sorts differently depending on the input order. Ah yes:

    - (void) testSorting
    {
    NSArray* words = [NSArray arrayWithObjects:@"laso", @"lassos", @"las", @"moo", @"lasso", @"aaa", @"lassoMore", @"le", @"mum", nil];
    NSString* problemWord = @"laßt";

    SEL sortSelector = @selector(localizedCaseInsensitiveCompare:);
    NSArray* sorted = [[words arrayByAddingObject:problemWord] sortedArrayUsingSelector:sortSelector];

    // words should sort consistently, no matter where the problem word is inserted
    NSUInteger count = [words count];
    for( NSUInteger insertIndex = 0; insertIndex < count; insertIndex++ ) {
      NSMutableArray* testSort = [[words mutableCopy] autorelease];
      [testSort insertObject:problemWord atIndex:insertIndex];
      [testSort sortUsingSelector:sortSelector];

      STAssertTrue( [testSort isEqual:sorted], @"sorted array should always be the same" );
    }
    }

    That fails 5 of out the 9 loop iterations.

    Well, I'm certainly very happy that Apple implements all the Unicode collation rules for us for various locales! But I think this must be classified as a bug.

    ~Martin
  • > So even if we get the right compare options, 'localizedCaseInsensitiveCompare:' may not be viable for strings *not* in the language of the current locale.

    That seems extremely unlikely and would be short-sighted of Apple, considering multilingual users and international information exchange.

    ~Martin
  • On May 7, 2012, at 13:35 , Martin Wierschin wrote:

    > As Jens mentioned, that doesn't make any sense. What good is a localized comparison method if not for sorting a list for display? I suppose you could be implying that Cocoa's sorting methods aren't optimized to assume comparisons are transitive (or special cases problematic strings/comparisons), but that seems quite unlikely.

    You're both sliding right over the specifics of what I said:

    1. You cannot reasonably use 'localizedCaseInsensitiveCompare:' *alone* to sort and later insert strings into a list of strings that is "case insensitive" in the sense you originally seemed to mean. Specifically, you cannot determine an order for *different* strings that compare *equal* using this method alone (e.g. "a" and "A").

    I never said nor intended to say that localized comparison methods *generally* are useless. Just this one, for this particular purpose. The correct method for this particular purpose seems to be 'compare: … options: NSCaseInsensitiveSearch | NSForcedOrderingSearch locale: nil'. Note that this is *not* a case-insensitive comparison -- "a" and "A" are *unequal* in this case.

    Another possible alternative is 'localizedStandardCompare:', although there's no API contract here about what strings are equal.

    This point has nothing to do with the language or character set used by the strings.

    2. Case is a *language-dependent* concept. There's *no* answer to the question "Are these two strings equal ignoring case?" unless you know the language of *each* of the strings.

    NSString has no comparison function that lets you specify the locale -- and hence language -- of each string separately. At best, the localized comparison methods let you specify the locale assumed for *both* strings.

    This doesn't necessarily mean you can't case-insensitively sort a list of mixed-language strings. It does necessarily mean you can't case-insensitively sort a list of mixed-language strings *using NSString comparison methods only*.

    The best you can do with NSString *only* is to assume that words like "laßt" are just mis-spelled English words (if your locale is us_EN). You can't then expect them to sort like they would in their actual language, though of course you can expect transitivity in the comparison methods.

    3. You *did* demonstrate (and 2 other people confirmed) a case-insensitive transitivity failure for us_EN in 10.6.8. Markus found no transitivity failure for de_AT; I found no transitivity failure for us_EN in 10.7.3, though you reported you did; no one has reported a transitivity failure other than a case-insensitive comparison involving "ß".

    There does seem to be a bug here, but the exact set of conditions isn't clear.

    Maybe give 'localizedStandardCompare:' a try?
  • >> As Jens mentioned, that doesn't make any sense. What good is a localized comparison method if not for sorting a list for display? I suppose you could be implying that Cocoa's sorting methods aren't optimized to assume comparisons are transitive (or special cases problematic strings/comparisons), but that seems quite unlikely.
    >
    > You're both sliding right over the specifics of what I said:

    >
    > 1. You cannot reasonably use 'localizedCaseInsensitiveCompare:' *alone* to sort and later insert strings into a list of strings that is "case insensitive" in the sense you originally seemed to mean.

    I'm sliding over the details of what you said because they're irrelevant for my purposes. All I need is a consistent/transitive comparison, which I think is a reasonable expectation. Again, jumping straight to the point:

    > The best you can do with NSString *only* is to assume that words like "laßt" are just mis-spelled English words (if your locale is us_EN). You can't then expect them to sort like they would in their actual language, though of course you can expect transitivity in the comparison methods.

    That's all I expect. I don't care where "laßt" is placed in the sort order for "en_US" (or any other locale), just that it has a single consistent location (so long as the same locale is always used for all comparisons).

    > 3. You *did* demonstrate (and 2 other people confirmed) a case-insensitive transitivity failure for us_EN in 10.6.8. Markus found no transitivity failure for de_AT; I found no transitivity failure for us_EN in 10.7.3, though you reported you did;

    That's correct- I reproduced that the comparison is not transitive for some strings with an eszett character on 10.6.8 and 10.7.3 with the "en_US" locale. I have since noticed that the project's base and deployment SDK are set to 10.6, so perhaps the bug only manifests because the frameworks are providing compatibility.

    > Maybe give 'localizedStandardCompare:' a try?

    Thank you for the idea, but that won't be suitable for me because I need to be able to specify the locale.

    Well, I'll try bumping the base SDK to 10.7. If I can still reproduce the issue after that, I'll file a radar.

    ~Martin
  • On May 7, 2012, at 16:07 , Martin Wierschin wrote:

    > I'm sliding over the details of what you said because they're irrelevant for my purposes.

    OK, but I'd *really* appreciate it if you'd avoid calling my replies senseless:

    >> On May 7, 2012, at 13:35 , Martin Wierschin wrote:
    >>
    >>> As Jens mentioned, that doesn't make any sense.

    On May 7, 2012, at 16:07 , Martin Wierschin wrote:

    >> Maybe give 'localizedStandardCompare:' a try?
    >
    > Thank you for the idea, but that won't be suitable for me because I need to be able to specify the locale.

    And you were going to specify the locale for 'localizedCaseInsensitiveCompare:' how?
  • >> I'm sliding over the details of what you said because they're irrelevant for my purposes.
    >
    > OK, but I'd *really* appreciate it if you'd avoid calling my replies senseless:

    I never said your replies are senseless. The sequence was:

    >> On 2012.05.05, at 4:51 PM, Kyle Sluder wrote:
    >> If I understand Quincey correctly, that's exactly what he's saying: the semantics of localizedCaseInsensitiveCompare: might be such that it is not appropriate for such algorithms.
    >
    > On 2012.05.07, at 1:35 PM, Martin Wierschin wrote:
    > As Jens mentioned, that doesn't make any sense.

    My quick dismissal that it "doesn't any sense" was about Kyle's suggestion that perhaps localizedCaseInsensitiveCompare isn't suitable for algorithms which require a transitive comparison operation.

    Anyways, I certainly didn't mean to offend you- if I did, I apologize.

    >>> Maybe give 'localizedStandardCompare:' a try?
    >>
    >> Thank you for the idea, but that won't be suitable for me because I need to be able to specify the locale.
    >
    > And you were going to specify the locale for 'localizedCaseInsensitiveCompare:' how?

    I'm not actually using that method, it's just a nice compact name for the mailing list, and has the same errant behavior as -[NSString compare:options:range:locale:] and CFStringCompareWithOptionsAndLocale.

    ~Martin
previous month may 2012 next month
MTWTFSS
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      
Go to today