characters in cocoa

  • This is a question about the treatment of characters in Cocoa.
    Because my program will do a lot with individual characters, both as
    basic type as well as encapsulated in objects, I am anxious to get it
    right from the start. I feel I should refrain in Cocoa as much as
    possible from working with plain C's char's, especially in the light
    of the (unsigned char) - (signed char) dilemma. Is that correct?

    The NSString class works with characters in Unicode, as is told in
    the "String Programming Guide for Cocoa". Its method characterAtIndex
    retrieves its contents as type unichar and this is typedef'ed in
    NSString.h as unsigned short. The NSNumber class then has methods
    like numberWithUnsignedShort but not more descriptive ones for
    unichar characters like numberWithUnichar; allthough otherwise the
    use of Unicode is so clearly present in NSString. Does someone know
    why this is missing?

    In fact I would prefer handling character objects with something like
    NSCharacter (for example as a subclass of NSNumber), but there seems
    to be not one among the Foundation classes. I feel myself not up to
    the task of subclassing in a class cluster. Maybe someone has already
    made one as a subclass of NSNumber?

    I would be obliged to hear from the experts what is considered the
    most appropriate way to handle characters in Cocoa programming.
    Thanks in advance.

    Hans van der Meer
  • On Sep 7, 2007, at 5:17 PM, Hans van der Meer wrote:
    > This is a question about the treatment of characters in Cocoa.
    > Because my program will do a lot with individual characters, both as
    > basic type as well as encapsulated in objects, I am anxious to get
    > it right from the start. I feel I should refrain in Cocoa as much as
    > possible from working with plain C's char's, especially in the light
    > of the (unsigned char) - (signed char) dilemma. Is that correct?

    My apps must work with CJK characters all the time. My experience is--
    it's pretty ok to work with UTF-8 with char/char* and UTF-16 with
    UniChar or just a plain unsigned short (any unsigned 16-bit word, that
    is). I don't really see a need to abstract too much from that--and on
    that same reason feel ok not to have [NSNumber numberWithUniChar:<a
    unichar>]. +numberWithUnsignedShort suffices.

    The thing is that NSString (as I know of, I might be ignorant on this)
    doesn't work on codepoint level, and I always to have compute the
    codepoint length (the real string length, not the one returned by -
    length) and do the surrogate exercise on my own (I now have some stock
    libraries for that though). An NSCharacter would only be useful if it
    operates on the codepoint level.

    So in short I really treat NSString like a UTF-16 vector, use UniChar/
    unsigned short all the time, and do the character-level operation on
    my own. But yeah, if Cocoa does have codepoint-level string
    operations, so much the better...

    d.
  • --- Hans van der Meer <hansm...> wrote:

    > This is a question about the treatment of characters
    > in Cocoa.
    > Because my program will do a lot with individual
    > characters, both as
    > basic type as well as encapsulated in objects, I am
    > anxious to get it
    > right from the start. I feel I should refrain in
    > Cocoa as much as
    > possible from working with plain C's char's,
    > especially in the light
    > of the (unsigned char) - (signed char) dilemma. Is
    > that correct?

    I don't feel the signedness of characters is a
    particularly big roadblock in general. A bigger issue
    is ensuring that you deal correctly with Unicode --
    one C char in UTF-8 encoding doesn't necessarily
    correspond to one glyph.

    > The NSNumber class
    > then has methods
    > like numberWithUnsignedShort but not more
    > descriptive ones for
    > unichar characters like numberWithUnichar; allthough
    > otherwise the
    > use of Unicode is so clearly present in NSString.
    > Does someone know
    > why this is missing?

    Because there's not generally a need for it? If you
    really just need to hold a unichar in some non-string
    object, NSValue would do the trick. Normally, though,
    NSString is the class for storing characters.

    I think the question is: What is it you're trying to
    accomplish here? You're telling us *how* you want to
    accomplish it, but it seems to be a little different
    from how things are normally done in Cocoa. If you
    tell us *what* you want to do, we can suggest
    Cocoa-friendly ways of doing it.

    Cheers,
    Chuck


    ____________________________________________________________________________________
    Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
    http://sims.yahoo.com/
  • On Sep 7, 2007, at 2:17 AM, Hans van der Meer wrote:

    > This is a question about the treatment of characters in Cocoa.
    > Because my program will do a lot with individual characters, both
    > as basic type as well as encapsulated in objects, I am anxious to
    > get it right from the start. I feel I should refrain in Cocoa as
    > much as possible from working with plain C's char's, especially in
    > the light of the (unsigned char) - (signed char) dilemma. Is that
    > correct?
    >
    > The NSString class works with characters in Unicode, as is told in
    > the "String Programming Guide for Cocoa". Its method
    > characterAtIndex retrieves its contents as type unichar and this is
    > typedef'ed in NSString.h as unsigned short. The NSNumber class then
    > has methods like numberWithUnsignedShort but not more descriptive
    > ones for unichar characters like numberWithUnichar; allthough
    > otherwise the use of Unicode is so clearly present in NSString.
    > Does someone know why this is missing?
    >
    > In fact I would prefer handling character objects with something
    > like NSCharacter (for example as a subclass of NSNumber), but there
    > seems to be not one among the Foundation classes. I feel myself not
    > up to the task of subclassing in a class cluster. Maybe someone has
    > already made one as a subclass of NSNumber?
    >
    > I would be obliged to hear from the experts what is considered the
    > most appropriate way to handle characters in Cocoa programming.
    > Thanks in advance.

    We try to discourage developers from working at the level of
    individual characters wherever possible, primarily because in Unicode
    the individual character is usually not the appropriate level at
    which to operate.  This is something that's difficult for those of us
    who were raised on char *'s to get used to, but it's important to get
    right.  In Unicode the appropriate object on which to operate for
    most semantic purposes is (at least) a character cluster, such as a
    base character and its combining marks, or a block of Hangul jamo.

    In Cocoa terms this is a range of characters in an NSString; suitable
    ranges can be obtained using such methods as
    rangeOfComposedCharacterSequenceAtIndex:.  This will also cover the
    case of surrogate pairs that arises from NSString's use of UTF-16.
    NSString/CFString supply a great variety of methods/functions that
    operate on character ranges in a Unicode-conformant fashion:  the
    rangeOfCharacterFromSet:... methods, the rangeOfString: methods, the
    compare:... methods, and so forth.  They also provide a long list of
    Unicode operations, such as casing, normalization, and other
    transformations.

    Even in apparently simple operations such as casing, the need for
    operating on more than a single character is apparent.  For example,
    in German we have ß->SS on uppercasing, going from one character to
    two; when we get to Greek, the complications increase significantly,
    and there are many other examples from less prominent languages.

    The basic recommendation for dealing with characters is to work with
    strings, and ranges in strings, and substrings, and as much as
    possible to use the NSString methods that deal with these; that lets
    the kit handle all of the difficult Unicode issues.  For those who
    need to do their own low-level processing, and who are willing to
    handle Unicode complications themselves, we provide access to UTF-16
    directly via characterAtIndex: et al., and to other representations
    with getBytes:... and related methods.

    In practice, I have found that many operations for which I had
    expected to have to use individual character operations (probably due
    to habits of thought acquired in the days of char *'s) actually could
    be done fairly simply with a little thought and a suitable
    combination of rangeOfCharacterFromSet:..., rangeOfString:...,
    compare:..., and related methods.

    Douglas Davidson
  • On 7 Sep 2007, at 21:02, <cocoa-dev-request...> wrote:

    >>
    >> I would be obliged to hear from the experts what is considered the
    >> most appropriate way to handle characters in Cocoa programming.
    >> Thanks in advance.
    >
    > We try to discourage developers from working at the level of
    > individual characters wherever possible, primarily because in Unicode
    > the individual character is usually not the appropriate level at
    > which to operate.  This is something that's difficult for those of us
    > who were raised on char *'s to get used to, but it's important to get
    > right.  In Unicode the appropriate object on which to operate for
    > most semantic purposes is (at least) a character cluster, such as a
    > base character and its combining marks, or a block of Hangul jamo.
    >
    > In Cocoa terms this is a range of characters in an NSString; suitable
    > ranges can be obtained using such methods as
    > rangeOfComposedCharacterSequenceAtIndex:.  This will also cover the
    > case of surrogate pairs that arises from NSString's use of UTF-16.
    > NSString/CFString supply a great variety of methods/functions that
    > operate on character ranges in a Unicode-conformant fashion:  the
    > rangeOfCharacterFromSet:... methods, the rangeOfString: methods, the
    > compare:... methods, and so forth.  They also provide a long list of
    > Unicode operations, such as casing, normalization, and other
    > transformations.
    >
    > Even in apparently simple operations such as casing, the need for
    > operating on more than a single character is apparent.  For example,
    > in German we have ß->SS on uppercasing, going from one character to
    > two; when we get to Greek, the complications increase significantly,
    > and there are many other examples from less prominent languages.
    >
    > The basic recommendation for dealing with characters is to work with
    > strings, and ranges in strings, and substrings, and as much as
    > possible to use the NSString methods that deal with these; that lets
    > the kit handle all of the difficult Unicode issues.  For those who
    > need to do their own low-level processing, and who are willing to
    > handle Unicode complications themselves, we provide access to UTF-16
    > directly via characterAtIndex: et al., and to other representations
    > with getBytes:... and related methods.

    This is an excellent summary.

    One might add that -[NSString length], which the documentation says
    "Returns the number of Unicode characters in the receiver." does
    nothing like this, but returns the number of shorts used with
    NSUnicodeStringEncoding (aka Utf-16).
    For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2 (if
    someone cannot handle Unicode (like the mail digest software at
    Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
    string clearly contains one character.

    And one should also note that "characterAtIndex:" does not do what
    the name indicates, but returns the short at the index in utf-16.

    getCharacters: "Returns by reference the characters from the
    receiver." - the documentation really should mention in which
    encoding these characters will be copied.

    Maybe the documentation could be slightly improved: it is confusing
    if it says "character" when it means "unsigned short int in a
    specific (but unspecified) encoding".

    Kind regards,

    Gerriet.
  • On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
    > One might add that -[NSString length], which the documentation says
    > "Returns the number of Unicode characters in the receiver." does
    > nothing like this, but returns the number of shorts used with
    > NSUnicodeStringEncoding (aka Utf-16).
    > For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2
    > (if someone cannot handle Unicode (like the mail digest software at
    > Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
    > string clearly contains one character.
    >
    > And one should also note that "characterAtIndex:" does not do what
    > the name indicates, but returns the short at the index in utf-16.
    >
    > getCharacters: "Returns by reference the characters from the
    > receiver." - the documentation really should mention in which
    > encoding these characters will be copied.
    >
    > Maybe the documentation could be slightly improved: it is confusing
    > if it says "character" when it means "unsigned short int in a
    > specific (but unspecified) encoding".

      Well, it *is* a character: It is the UTF16 character that would be
    at the specified index in the normalized form, I guess...?

    Cheers,
    -- M. Uli Kusterer
    http://www.zathras.de
  • On 9/10/07, Uli Kusterer <witness.of.teachtext...> wrote:
    > Well, it *is* a character: It is the UTF16 character that would be
    > at the specified index in the normalized form, I guess...?

    Composites side, it may also be pointing to a trailing (low)
    surrogate, which are not that meaningful if you don't have its leading
    (high) counterpart, and so characterAtIndex: can be confusing in those
    cases...

    d.
  • On 10 Sep 2007, at 15:36, Uli Kusterer wrote:

    > On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
    >> One might add that -[NSString length], which the documentation
    >> says "Returns the number of Unicode characters in the receiver."
    >> does nothing like this, but returns the number of shorts used with
    >> NSUnicodeStringEncoding (aka Utf-16).
    >> For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2
    >> (if someone cannot handle Unicode (like the mail digest software
    >> at Apple) : this is a DESERET CAPITAL LETTER LONG I) - although
    >> the string clearly contains one character.
    >>
    >> And one should also note that "characterAtIndex:" does not do what
    >> the name indicates, but returns the short at the index in utf-16.
    >>
    >> getCharacters: "Returns by reference the characters from the
    >> receiver." - the documentation really should mention in which
    >> encoding these characters will be copied.
    >>
    >> Maybe the documentation could be slightly improved: it is
    >> confusing if it says "character" when it means "unsigned short int
    >> in a specific (but unspecified) encoding".
    >
    > Well, it *is* a character: It is the UTF16 character that would be
    > at the specified index in the normalized form, I guess...?

    Well, when I hear "character" I think of something like what the
    LayoutManager calls "glyph": something what can be seen. As opposed
    to bits representing such a "character" according to some encoding.
    So I would say that the strings "ð??€" and "∂" both have one
    character; even if one is encoded in utf-16 as 2 shorts, the other as
    1 short.

    Maybe this is the reason for my confusion.

    Kind regards,

    Gerriet.
  • On 9/10/07, Uli Kusterer <witness.of.teachtext...> wrote:
    > On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
    >> One might add that -[NSString length], which the documentation says
    >> "Returns the number of Unicode characters in the receiver." does
    >> nothing like this, but returns the number of shorts used with
    >> NSUnicodeStringEncoding (aka Utf-16).
    >> For example: [[NSString stringWithUTF8String: "ð  €" ] length] = 2
    >> (if someone cannot handle Unicode (like the mail digest software at
    >> Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
    >> string clearly contains one character.
    >>
    >> And one should also note that "characterAtIndex:" does not do what
    >> the name indicates, but returns the short at the index in utf-16.
    >>
    >> getCharacters: "Returns by reference the characters from the
    >> receiver." - the documentation really should mention in which
    >> encoding these characters will be copied.
    >>
    >> Maybe the documentation could be slightly improved: it is confusing
    >> if it says "character" when it means "unsigned short int in a
    >> specific (but unspecified) encoding".
    >
    > Well, it *is* a character: It is the UTF16 character that would be
    > at the specified index in the normalized form, I guess...?

    Ah, but UTF-16 code units are not characters; the term "UTF-16
    character" is meaningless. For the BMP, there *is* a one-to-one
    correspondence between UTF-16 code units and Unicode code points, but
    this is not true in the general case. Outside of the BMP, it takes two
    UTF-16 code units to represent a single Unicode code point.

    --
    Clark S. Cox III
    <clarkcox3...>
  • On Sep 10, 2007, at 8:21 AM, Clark Cox wrote:

    > Ah, but UTF-16 code units are not characters; the term "UTF-16
    > character" is meaningless. For the BMP, there *is* a one-to-one
    > correspondence between UTF-16 code units and Unicode code points, but
    > this is not true in the general case. Outside of the BMP, it takes two
    > UTF-16 code units to represent a single Unicode code point.

    We have this terminology problem for historical reasons;
    characterAtIndex: antedates the introduction of surrogate pairs.
    Whatever the terminology, NSStrings are conceptually UTF-16, and the -
    length et al. methods reflect that.  (This is a common practice in
    other frameworks as well.)

    Fortunately, as I mentioned, most developers should not have to worry
    about this.  If you work with ranges and substrings rather than with
    individual characters, and use the NSString methods that deal with
    ranges, they should automatically handle not only most issues with
    surrogate pairs, but also the more common cases of combining
    characters, Hangul, etc.

    Chapter 2 of the Unicode 5 book has a very good discussion of "text
    elements", which explains in great detail why it is that the elements
    that are important for most text processes are in general sequences
    of characters rather than single characters.  Single characters are
    important for the fundamental definitional purposes of the standard,
    but in practice what one wishes to deal with for text processing is a
    sequence of characters constituting a cluster or larger unit.

    Douglas Davidson
previous month september 2007 next month
MTWTFSS
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
Go to today