characters in cocoa
-
This is a question about the treatment of characters in Cocoa.
Because my program will do a lot with individual characters, both as
basic type as well as encapsulated in objects, I am anxious to get it
right from the start. I feel I should refrain in Cocoa as much as
possible from working with plain C's char's, especially in the light
of the (unsigned char) - (signed char) dilemma. Is that correct?
The NSString class works with characters in Unicode, as is told in
the "String Programming Guide for Cocoa". Its method characterAtIndex
retrieves its contents as type unichar and this is typedef'ed in
NSString.h as unsigned short. The NSNumber class then has methods
like numberWithUnsignedShort but not more descriptive ones for
unichar characters like numberWithUnichar; allthough otherwise the
use of Unicode is so clearly present in NSString. Does someone know
why this is missing?
In fact I would prefer handling character objects with something like
NSCharacter (for example as a subclass of NSNumber), but there seems
to be not one among the Foundation classes. I feel myself not up to
the task of subclassing in a class cluster. Maybe someone has already
made one as a subclass of NSNumber?
I would be obliged to hear from the experts what is considered the
most appropriate way to handle characters in Cocoa programming.
Thanks in advance.
Hans van der Meer -
On Sep 7, 2007, at 5:17 PM, Hans van der Meer wrote:
> This is a question about the treatment of characters in Cocoa.
> Because my program will do a lot with individual characters, both as
> basic type as well as encapsulated in objects, I am anxious to get
> it right from the start. I feel I should refrain in Cocoa as much as
> possible from working with plain C's char's, especially in the light
> of the (unsigned char) - (signed char) dilemma. Is that correct?
My apps must work with CJK characters all the time. My experience is--
it's pretty ok to work with UTF-8 with char/char* and UTF-16 with
UniChar or just a plain unsigned short (any unsigned 16-bit word, that
is). I don't really see a need to abstract too much from that--and on
that same reason feel ok not to have [NSNumber numberWithUniChar:<a
unichar>]. +numberWithUnsignedShort suffices.
The thing is that NSString (as I know of, I might be ignorant on this)
doesn't work on codepoint level, and I always to have compute the
codepoint length (the real string length, not the one returned by -
length) and do the surrogate exercise on my own (I now have some stock
libraries for that though). An NSCharacter would only be useful if it
operates on the codepoint level.
So in short I really treat NSString like a UTF-16 vector, use UniChar/
unsigned short all the time, and do the character-level operation on
my own. But yeah, if Cocoa does have codepoint-level string
operations, so much the better...
d. -
--- Hans van der Meer <hansm...> wrote:
> This is a question about the treatment of characters
> in Cocoa.
> Because my program will do a lot with individual
> characters, both as
> basic type as well as encapsulated in objects, I am
> anxious to get it
> right from the start. I feel I should refrain in
> Cocoa as much as
> possible from working with plain C's char's,
> especially in the light
> of the (unsigned char) - (signed char) dilemma. Is
> that correct?
I don't feel the signedness of characters is a
particularly big roadblock in general. A bigger issue
is ensuring that you deal correctly with Unicode --
one C char in UTF-8 encoding doesn't necessarily
correspond to one glyph.
> The NSNumber class
> then has methods
> like numberWithUnsignedShort but not more
> descriptive ones for
> unichar characters like numberWithUnichar; allthough
> otherwise the
> use of Unicode is so clearly present in NSString.
> Does someone know
> why this is missing?
Because there's not generally a need for it? If you
really just need to hold a unichar in some non-string
object, NSValue would do the trick. Normally, though,
NSString is the class for storing characters.
I think the question is: What is it you're trying to
accomplish here? You're telling us *how* you want to
accomplish it, but it seems to be a little different
from how things are normally done in Cocoa. If you
tell us *what* you want to do, we can suggest
Cocoa-friendly ways of doing it.
Cheers,
Chuck
____________________________________________________________________________________
Moody friends. Drama queens. Your life? Nope! - their life, your story. Play Sims Stories at Yahoo! Games.
http://sims.yahoo.com/ -
On Sep 7, 2007, at 2:17 AM, Hans van der Meer wrote:
> This is a question about the treatment of characters in Cocoa.
> Because my program will do a lot with individual characters, both
> as basic type as well as encapsulated in objects, I am anxious to
> get it right from the start. I feel I should refrain in Cocoa as
> much as possible from working with plain C's char's, especially in
> the light of the (unsigned char) - (signed char) dilemma. Is that
> correct?
>
> The NSString class works with characters in Unicode, as is told in
> the "String Programming Guide for Cocoa". Its method
> characterAtIndex retrieves its contents as type unichar and this is
> typedef'ed in NSString.h as unsigned short. The NSNumber class then
> has methods like numberWithUnsignedShort but not more descriptive
> ones for unichar characters like numberWithUnichar; allthough
> otherwise the use of Unicode is so clearly present in NSString.
> Does someone know why this is missing?
>
> In fact I would prefer handling character objects with something
> like NSCharacter (for example as a subclass of NSNumber), but there
> seems to be not one among the Foundation classes. I feel myself not
> up to the task of subclassing in a class cluster. Maybe someone has
> already made one as a subclass of NSNumber?
>
> I would be obliged to hear from the experts what is considered the
> most appropriate way to handle characters in Cocoa programming.
> Thanks in advance.
We try to discourage developers from working at the level of
individual characters wherever possible, primarily because in Unicode
the individual character is usually not the appropriate level at
which to operate. This is something that's difficult for those of us
who were raised on char *'s to get used to, but it's important to get
right. In Unicode the appropriate object on which to operate for
most semantic purposes is (at least) a character cluster, such as a
base character and its combining marks, or a block of Hangul jamo.
In Cocoa terms this is a range of characters in an NSString; suitable
ranges can be obtained using such methods as
rangeOfComposedCharacterSequenceAtIndex:. This will also cover the
case of surrogate pairs that arises from NSString's use of UTF-16.
NSString/CFString supply a great variety of methods/functions that
operate on character ranges in a Unicode-conformant fashion: the
rangeOfCharacterFromSet:... methods, the rangeOfString: methods, the
compare:... methods, and so forth. They also provide a long list of
Unicode operations, such as casing, normalization, and other
transformations.
Even in apparently simple operations such as casing, the need for
operating on more than a single character is apparent. For example,
in German we have ß->SS on uppercasing, going from one character to
two; when we get to Greek, the complications increase significantly,
and there are many other examples from less prominent languages.
The basic recommendation for dealing with characters is to work with
strings, and ranges in strings, and substrings, and as much as
possible to use the NSString methods that deal with these; that lets
the kit handle all of the difficult Unicode issues. For those who
need to do their own low-level processing, and who are willing to
handle Unicode complications themselves, we provide access to UTF-16
directly via characterAtIndex: et al., and to other representations
with getBytes:... and related methods.
In practice, I have found that many operations for which I had
expected to have to use individual character operations (probably due
to habits of thought acquired in the days of char *'s) actually could
be done fairly simply with a little thought and a suitable
combination of rangeOfCharacterFromSet:..., rangeOfString:...,
compare:..., and related methods.
Douglas Davidson -
On 7 Sep 2007, at 21:02, <cocoa-dev-request...> wrote:
>>
>> I would be obliged to hear from the experts what is considered the
>> most appropriate way to handle characters in Cocoa programming.
>> Thanks in advance.
>
> We try to discourage developers from working at the level of
> individual characters wherever possible, primarily because in Unicode
> the individual character is usually not the appropriate level at
> which to operate. This is something that's difficult for those of us
> who were raised on char *'s to get used to, but it's important to get
> right. In Unicode the appropriate object on which to operate for
> most semantic purposes is (at least) a character cluster, such as a
> base character and its combining marks, or a block of Hangul jamo.
>
> In Cocoa terms this is a range of characters in an NSString; suitable
> ranges can be obtained using such methods as
> rangeOfComposedCharacterSequenceAtIndex:. This will also cover the
> case of surrogate pairs that arises from NSString's use of UTF-16.
> NSString/CFString supply a great variety of methods/functions that
> operate on character ranges in a Unicode-conformant fashion: the
> rangeOfCharacterFromSet:... methods, the rangeOfString: methods, the
> compare:... methods, and so forth. They also provide a long list of
> Unicode operations, such as casing, normalization, and other
> transformations.
>
> Even in apparently simple operations such as casing, the need for
> operating on more than a single character is apparent. For example,
> in German we have ß->SS on uppercasing, going from one character to
> two; when we get to Greek, the complications increase significantly,
> and there are many other examples from less prominent languages.
>
> The basic recommendation for dealing with characters is to work with
> strings, and ranges in strings, and substrings, and as much as
> possible to use the NSString methods that deal with these; that lets
> the kit handle all of the difficult Unicode issues. For those who
> need to do their own low-level processing, and who are willing to
> handle Unicode complications themselves, we provide access to UTF-16
> directly via characterAtIndex: et al., and to other representations
> with getBytes:... and related methods.
This is an excellent summary.
One might add that -[NSString length], which the documentation says
"Returns the number of Unicode characters in the receiver." does
nothing like this, but returns the number of shorts used with
NSUnicodeStringEncoding (aka Utf-16).
For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2 (if
someone cannot handle Unicode (like the mail digest software at
Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
string clearly contains one character.
And one should also note that "characterAtIndex:" does not do what
the name indicates, but returns the short at the index in utf-16.
getCharacters: "Returns by reference the characters from the
receiver." - the documentation really should mention in which
encoding these characters will be copied.
Maybe the documentation could be slightly improved: it is confusing
if it says "character" when it means "unsigned short int in a
specific (but unspecified) encoding".
Kind regards,
Gerriet. -
On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
> One might add that -[NSString length], which the documentation says
> "Returns the number of Unicode characters in the receiver." does
> nothing like this, but returns the number of shorts used with
> NSUnicodeStringEncoding (aka Utf-16).
> For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2
> (if someone cannot handle Unicode (like the mail digest software at
> Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
> string clearly contains one character.
>
> And one should also note that "characterAtIndex:" does not do what
> the name indicates, but returns the short at the index in utf-16.
>
> getCharacters: "Returns by reference the characters from the
> receiver." - the documentation really should mention in which
> encoding these characters will be copied.
>
> Maybe the documentation could be slightly improved: it is confusing
> if it says "character" when it means "unsigned short int in a
> specific (but unspecified) encoding".
Well, it *is* a character: It is the UTF16 character that would be
at the specified index in the normalized form, I guess...?
Cheers,
-- M. Uli Kusterer
http://www.zathras.de -
On 9/10/07, Uli Kusterer <witness.of.teachtext...> wrote:
> Well, it *is* a character: It is the UTF16 character that would be
> at the specified index in the normalized form, I guess...?
Composites side, it may also be pointing to a trailing (low)
surrogate, which are not that meaningful if you don't have its leading
(high) counterpart, and so characterAtIndex: can be confusing in those
cases...
d. -
On 10 Sep 2007, at 15:36, Uli Kusterer wrote:
> On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
>> One might add that -[NSString length], which the documentation
>> says "Returns the number of Unicode characters in the receiver."
>> does nothing like this, but returns the number of shorts used with
>> NSUnicodeStringEncoding (aka Utf-16).
>> For example: [[NSString stringWithUTF8String: "ð??€" ] length] = 2
>> (if someone cannot handle Unicode (like the mail digest software
>> at Apple) : this is a DESERET CAPITAL LETTER LONG I) - although
>> the string clearly contains one character.
>>
>> And one should also note that "characterAtIndex:" does not do what
>> the name indicates, but returns the short at the index in utf-16.
>>
>> getCharacters: "Returns by reference the characters from the
>> receiver." - the documentation really should mention in which
>> encoding these characters will be copied.
>>
>> Maybe the documentation could be slightly improved: it is
>> confusing if it says "character" when it means "unsigned short int
>> in a specific (but unspecified) encoding".
>
> Well, it *is* a character: It is the UTF16 character that would be
> at the specified index in the normalized form, I guess...?
Well, when I hear "character" I think of something like what the
LayoutManager calls "glyph": something what can be seen. As opposed
to bits representing such a "character" according to some encoding.
So I would say that the strings "ð??€" and "∂" both have one
character; even if one is encoded in utf-16 as 2 shorts, the other as
1 short.
Maybe this is the reason for my confusion.
Kind regards,
Gerriet. -
On 9/10/07, Uli Kusterer <witness.of.teachtext...> wrote:
> On 10.09.2007, at 13:38, Gerriet M. Denkmann wrote:
>> One might add that -[NSString length], which the documentation says
>> "Returns the number of Unicode characters in the receiver." does
>> nothing like this, but returns the number of shorts used with
>> NSUnicodeStringEncoding (aka Utf-16).
>> For example: [[NSString stringWithUTF8String: "ð €" ] length] = 2
>> (if someone cannot handle Unicode (like the mail digest software at
>> Apple) : this is a DESERET CAPITAL LETTER LONG I) - although the
>> string clearly contains one character.
>>
>> And one should also note that "characterAtIndex:" does not do what
>> the name indicates, but returns the short at the index in utf-16.
>>
>> getCharacters: "Returns by reference the characters from the
>> receiver." - the documentation really should mention in which
>> encoding these characters will be copied.
>>
>> Maybe the documentation could be slightly improved: it is confusing
>> if it says "character" when it means "unsigned short int in a
>> specific (but unspecified) encoding".
>
> Well, it *is* a character: It is the UTF16 character that would be
> at the specified index in the normalized form, I guess...?
Ah, but UTF-16 code units are not characters; the term "UTF-16
character" is meaningless. For the BMP, there *is* a one-to-one
correspondence between UTF-16 code units and Unicode code points, but
this is not true in the general case. Outside of the BMP, it takes two
UTF-16 code units to represent a single Unicode code point.
--
Clark S. Cox III
<clarkcox3...> -
On Sep 10, 2007, at 8:21 AM, Clark Cox wrote:
> Ah, but UTF-16 code units are not characters; the term "UTF-16
> character" is meaningless. For the BMP, there *is* a one-to-one
> correspondence between UTF-16 code units and Unicode code points, but
> this is not true in the general case. Outside of the BMP, it takes two
> UTF-16 code units to represent a single Unicode code point.
We have this terminology problem for historical reasons;
characterAtIndex: antedates the introduction of surrogate pairs.
Whatever the terminology, NSStrings are conceptually UTF-16, and the -
length et al. methods reflect that. (This is a common practice in
other frameworks as well.)
Fortunately, as I mentioned, most developers should not have to worry
about this. If you work with ranges and substrings rather than with
individual characters, and use the NSString methods that deal with
ranges, they should automatically handle not only most issues with
surrogate pairs, but also the more common cases of combining
characters, Hangul, etc.
Chapter 2 of the Unicode 5 book has a very good discussion of "text
elements", which explains in great detail why it is that the elements
that are important for most text processes are in general sequences
of characters rather than single characters. Single characters are
important for the fundamental definitional purposes of the standard,
but in practice what one wishes to deal with for text processing is a
sequence of characters constituting a cluster or larger unit.
Douglas Davidson


