FROM : Deborah Goldsmith
DATE : Sat Apr 07 03:30:46 2007
On Apr 4, 2007, at 9:42 AM, Douglas Davidson wrote:
> On Apr 4, 2007, at 8:05 AM, Ewan Delanoy wrote:
>
>> -when an NSString or
>> NSAttributedString (let's call it s) appears on-screen as, say,
>> "(a with
>> tilda)(other characters ...)" is
>> it guaranteed that [s characterAtIndex: 0] will be "a with
>> tilda", and
>> not "a" (with "tilda" for a second
>> character) ?
>>
>> -If this is not the case, I need a more accurate version of
>> "characterAtIndex:". Is this already
>> built-in ?
>
> Yes. The characterAtIndex: method should be avoided wherever
> possible; with Unicode strings, examining a single character
> usually is not sufficient. Instead, use methods like
> compare:options:range:, rangeOfString:options:range:, and
> rangeOfCharacterFromSet:options:range:, which will give you the
> Unicode-conformant operations you are looking for, with a wide
> variety of options.
>
> If you need to extract substrings, be sure to use
> rangeOfComposedCharacterSequenceAtIndex: to make sure that you are
> not dividing a composed character sequence. If you wish to replace
> substrings in a mutable string, try
> replaceOccurrencesOfString:withString:options:range:.
>
> NSString does have methods to precompose or decompose an entire
> string, but these methods are really useful only in special
> circumstances--for example, when you are dealing with existing code
> that for some reason requires one form or the other. Bear in mind
> that most combinations of base characters and combining marks do
> not have precomposed forms. In general, you are better off using
> the methods mentioned above for Unicode-conformant comparisons.
In addition to what Doug says, bear in mind that even precomposed
Unicode cannot be accessed one "unichar" at a time. First, there may
still be surrogate pairs (two consecutive UTF-16 code units used to
represent characters beyond the first 16 bits of Unicode), and
second, there are some characters that cannot be represented by a
single Unicode code point, even in the canonical precomposed form of
Unicode (NFC == Normalization Form C). This is because Unicode does
not contain a precomposed version of the character in question.
Finally, even if there are no individual characters that require
multiple unichar's, some languages have linguistic units consisting
of multiple characters that shouldn't be broken apart.
Deborah Goldsmith
Internationalization, Unicode liaison
Apple Inc.
<email_removed>
DATE : Sat Apr 07 03:30:46 2007
On Apr 4, 2007, at 9:42 AM, Douglas Davidson wrote:
> On Apr 4, 2007, at 8:05 AM, Ewan Delanoy wrote:
>
>> -when an NSString or
>> NSAttributedString (let's call it s) appears on-screen as, say,
>> "(a with
>> tilda)(other characters ...)" is
>> it guaranteed that [s characterAtIndex: 0] will be "a with
>> tilda", and
>> not "a" (with "tilda" for a second
>> character) ?
>>
>> -If this is not the case, I need a more accurate version of
>> "characterAtIndex:". Is this already
>> built-in ?
>
> Yes. The characterAtIndex: method should be avoided wherever
> possible; with Unicode strings, examining a single character
> usually is not sufficient. Instead, use methods like
> compare:options:range:, rangeOfString:options:range:, and
> rangeOfCharacterFromSet:options:range:, which will give you the
> Unicode-conformant operations you are looking for, with a wide
> variety of options.
>
> If you need to extract substrings, be sure to use
> rangeOfComposedCharacterSequenceAtIndex: to make sure that you are
> not dividing a composed character sequence. If you wish to replace
> substrings in a mutable string, try
> replaceOccurrencesOfString:withString:options:range:.
>
> NSString does have methods to precompose or decompose an entire
> string, but these methods are really useful only in special
> circumstances--for example, when you are dealing with existing code
> that for some reason requires one form or the other. Bear in mind
> that most combinations of base characters and combining marks do
> not have precomposed forms. In general, you are better off using
> the methods mentioned above for Unicode-conformant comparisons.
In addition to what Doug says, bear in mind that even precomposed
Unicode cannot be accessed one "unichar" at a time. First, there may
still be surrogate pairs (two consecutive UTF-16 code units used to
represent characters beyond the first 16 bits of Unicode), and
second, there are some characters that cannot be represented by a
single Unicode code point, even in the canonical precomposed form of
Unicode (NFC == Normalization Form C). This is because Unicode does
not contain a precomposed version of the character in question.
Finally, even if there are no individual characters that require
multiple unichar's, some languages have linguistic units consisting
of multiple characters that shouldn't be broken apart.
Deborah Goldsmith
Internationalization, Unicode liaison
Apple Inc.
<email_removed>






Cocoa mail archive

