Skip navigation.
 
mlRe: characterAtIndex: method and composite characters
FROM : Deborah Goldsmith
DATE : Sat Apr 07 03:30:46 2007

On Apr 4, 2007, at 9:42 AM, Douglas Davidson wrote:
> On Apr 4, 2007, at 8:05 AM, Ewan Delanoy wrote:
>

>>  -when an NSString or
>> NSAttributedString (let's call it s) appears on-screen as, say, 
>> "(a with
>> tilda)(other characters ...)" is
>> it guaranteed that  [s characterAtIndex: 0] will be "a with 
>> tilda", and
>> not "a" (with "tilda" for a second
>> character) ?
>>
>>  -If this is not the case, I need a more accurate version of
>> "characterAtIndex:". Is this already
>> built-in ?

>
> Yes.  The characterAtIndex: method should be avoided wherever 
> possible; with Unicode strings, examining a single character 
> usually is not sufficient.  Instead, use methods like 
> compare:options:range:, rangeOfString:options:range:, and 
> rangeOfCharacterFromSet:options:range:, which will give you the 
> Unicode-conformant operations you are looking for, with a wide 
> variety of options.
>
> If you need to extract substrings, be sure to use 
> rangeOfComposedCharacterSequenceAtIndex: to make sure that you are 
> not dividing a composed character sequence.  If you wish to replace 
> substrings in a mutable string, try 
> replaceOccurrencesOfString:withString:options:range:.
>
> NSString does have methods to precompose or decompose an entire 
> string, but these methods are really useful only in special 
> circumstances--for example, when you are dealing with existing code 
> that for some reason requires one form or the other.  Bear in mind 
> that most combinations of base characters and combining marks do 
> not have precomposed forms.  In general, you are better off using 
> the methods mentioned above for Unicode-conformant comparisons.


In addition to what Doug says, bear in mind that even precomposed 
Unicode cannot be accessed one "unichar" at a time. First, there may 
still be surrogate pairs (two consecutive UTF-16 code units used to 
represent characters beyond the first 16 bits of Unicode), and 
second, there are some characters that cannot be represented by a 
single Unicode code point, even in the canonical precomposed form of 
Unicode (NFC == Normalization Form C). This is because Unicode does 
not contain a precomposed version of the character in question.

Finally, even if there are no individual characters that require 
multiple unichar's, some languages have linguistic units consisting 
of multiple characters that shouldn't be broken apart.

Deborah Goldsmith
Internationalization, Unicode liaison
Apple Inc.
<email_removed>