Unicode in constant @"" NSStrings

  • This message is mostly for edification and mailing-list / searching
    posterity.  Hopefully others will find it useful.

    There has been support for Unicode in constant @"" NSStrings since
    Xcode 3.0, but it wasn't a widely known nor well documented feature
    (at least, that's my impression).  Prior to Xcode 3.0 you were limited
    to 7-bit ASCII characters only, and creating Unicode strings required
    a bit of effort- the usual way was to create a UTF8 encoded C string,
    and then create a NSString from that, for example [NSString
    stringWithUTF8String:"\342\202\254 \303\237"].  I filed a bug
    (#5799172) to have the documentation updated to reflect the new
    functionality.

    Just got a note that the bug was being closed because it has been
    documented in the latest round of documentation updates.  This feature
    is now 'officially documented', which is great news for anyone who has
    to deal with Unicode strings in their source files.

    http://developer.apple.com/documentation/Cocoa/Conceptual/ObjectiveC/Articl
    es/chapter_950_section_5.html#//apple_ref/doc/uid/TP30001163-CH3-TPXREF104


    Specifically, the section regarding @"string", which I'll copy and paste here:

    ---
    Defines a constant NSString object in the current module and
    initializes the object with the specified string.

    On Mac OS X v10.4 and earlier, the string must be 7-bit ASCII-encoded.
    On Mac OS X v10.5 and later (with Xcode 3.0 and later), you can also
    use UTF-16 encoded strings. (The runtime from Mac OS X v10.2 and later
    supports UTF-16 encoded strings, so if you use Mac OS X v10.5 to
    compile an application for Mac OS X v10.2 and later, you can use
    UTF-16 encoded strings.)
    ---

    First, a warning.  The rest of this message should not be taken as an
    'authoritative reference' on the topic as many of the statements below
    have been gleamed from a combination of reading the C99 standard, the
    GCC sources, educated guesses, and occasionally outright speculation.
    You have been warned.

    The first sentence from
    http://developer.apple.com/documentation/DeveloperTools/gcc-4.2.1/cpp/Chara
    cter-sets.html#Character-sets

    sums it up nicely:

    "Source code character set processing in C and related languages is
    rather complicated."

    This is an understatement.

    GCC uses UTF-8 as its default 'source character set'.  As a general
    rule of thumb, things will probably work out the way you expect them
    too if the source code that GCC is given to compile is encoded as
    UTF-8 (UTF-8 is a superset of 7-bit ASCII, i.e. all 7-bit ASCII is
    valid UTF-8).

    C99 also defines something called the 'execution character set'.  This
    is the character set that the executing program will use and the
    character set that string literals will be converted to for their
    binary representations.  For example, a source file in EBCDIC and
    using UTF-8 as the execution character set will perform the following
    conversion:

    EBCDIC bytes for "HIJK": 0xC8 0xC9 0xD1 0xD2
    UTF-8 bytes for "HIJK": 0x48 0x49 0x4A 0x4B // Bytes that end up in
    the object file.

    It gets more complicated from here.  Since Macs are already Unicode
    savvy and UTF-8 (or ASCII) is the default source character set, I'm
    just going to skip over these details.  I mention this because, for me
    at least, I have a mental model that expects a string literal in
    source to always convert to the same sequence of bytes no matter what.
    That's actually not the case under C99 and it might catch you off
    guard.  Thankfully, the GCC default of UTF-8 is likely to produce the
    results you're expecting.  One gotcha is that Mac applications, such
    as Xcode, will use MacRoman as the default character set.  If you're
    using Xcode, you may have to change the source character set encoding
    to UTF-8 to get things to work seamlessly.  You can change this in
    Xcode by by doing a 'Get Info' on a source file, then choosing the
    'General' tab.  Roughly in the middle there should be a field for the
    file encoding.  If you change the files encoding, I'm pretty sure
    Xcode will ask if you want to convert the contents of the file to the
    new encoding, but it's been awhile since I've had to do that.  You can
    set the default file encoding using the Xcode preferences > Text
    Editing section.

    Now, back to Cocoa and Unicode in @"" strings.  The documentation, at
    least as I read it, is slightly misleading.  It is possible to have
    constant Unicode @"" NSStrings, but they do not necessarily have to be
    encoded as UTF-16 in your source code for you to take advantage of
    them.

    This part is speculation, but it's fairly reasonable.  What I suspect
    happens under the hood is that the GCC code for @"" strings goes
    something like this:

    o Escape sequences, such as \n, \u, and \U are converted to their
    respective byte sequences in the source character set.

    o The GCC ObjC @"" 'function' examines the string literal:
      - If the string literal contains only 7-bit ASCII (or possibly
    MacRoman) characters, then the old / normal constant string object
    creation process is executed using the simple 8 bit representation of
    the string.  The bytes for the string are stored in the __cstring
    object section.
      - If the string literal contains > 7-bit ASCII / Unicode characters,
    then the string is converted in to UTF-16 using the target
    architectures endian.  The UTF-16 bytes are stored in the __ustring
    object section.

    Using UTF-8 as the source file encoding (again, this is the default
    for GCC, but might not be for your source files in Xcode, which I
    believe used to default to MacRoman), the following code 'just works':

      NSString *unicodeString0 = @"0: € ß"; // Stored in the source code
    as the UTF-8 sequence 30 3a 20 e2 82 ac 20 c3 9f. NSString at
    execution time: '0: € ß'.
      NSString *unicodeString1 = @"1: \u20ac \u00df"; // Using C99 \u
    style escapes. Requires -std=gnu99 (or equivalent) or the compiler
    will issue a warning. NSString at execution time: '1: € ß'.
      NSString *unicodeString2 = @"2: \342\202\254 \303\237"; // Octal
    escaped UTF-8 sequence. NSString at execution time: '2: € ß'.
      NSString *unicodeString3 = @"3: $ ss"; // 7-bit ASCII only, not
    converted to UTF-16, remains 8 bit. NSString at execution time: '3: $
    ss'

    For the curious, the above also works when the source code is
    converted to UTF-16 and gcc is called with '-finput-charset=UTF-16
    -std=gnu99'.  Even the octal escaped UTF-8 sequence is correctly
    'interpreted' and produces the 'correct' results.  A small hitch I
    encountered when I gave this a try was that gcc decided that all
    source files were encoded as UTF-16, not just the source file in
    question.  This is obviously a problem with #include / #import ed
    header files, which are almost certainly in ASCII / UTF-8.  Maybe
    there's a way to correct that, but it was just a quick test to see
    what would happen.

    -----

    The bottom line is that if you're using UTF-8 as your source code
    character set, you can now just copy and paste Unicode text straight
    in to your constant @"" strings.  The compiler will automagically pick
    the best encoding for the characters in the string.  This of course
    assumes that you're using Xcode 3.0+ / gcc 4+ on 10.5+ to compile said
    source code.  The resulting object files / executables with the new
    Unicode string functionality will work all the way back to 10.2,
    however, it's not a 10.5+ only feature.  And there's no UTF-16 endian
    issues to worry about since the compiler builds the UTF-16 strings
    separately for each architecture targeted.

    If you have Safari 3.1+ and Javascript enabled, you can take a look at
    some related documentation I recently wrote:
    http://regexkit.sourceforge.net/RegexKitLite/index.html#RegexKitLiteCookboo
    k

    Again, if you're running a supported version of Safari (3.1+) and
    have Javascript enabled, then after the introductory paragraphs there
    should be a section titled "Enhanced Copy To Clipboard Functionality".
    Otherwise, if the browser your using isn't supported, the "Enhanced
    Copy To Clipboard Functionality" is disabled and that section remains
    hidden.  It covers some of these same points and it also includes
    functionality to create NSStrings that contain Unicode that is then
    copied to the clipboard which you can then paste in to your source
    code.  It also deals with escaping '\' backslashes and other
    problematic C string literal characters since its primary purpose is
    to simplify the process of correctly escaping a regular expression
    (which make heavy use of the '\' character) for use in a NSString /
    RegexKitLite.
previous month november 2008 next month
MTWTFSS
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Go to today