How to count composed characters in NSString?

  • Hi,

    I have been trying to find this in the documentation and list archives
    but without success so far. What is the best way to count the number
    of characters in an NSString taking account of the fact that some
    characters may take up multiple 16 bit slots. Using "-
    (NSUInteger)length" is thus not the right way. Using a series of calls
    to "rangeOfComposedCharacterSequenceAtIndex:" seems like a
    possibility, but I am not sure this would be the most efficient way.
    Is there a simple and straightforward solution? I would like to be
    able to display the number of characters in a string and not report
    the wrong results for foreign languages (which I would get if I simply
    took the length of the string). I need a solution that does not only
    work in Leopard (i.e. CFStringTokenizer is not an option) and that
    does not require using the lower level UCFindTextBreak.

    Thanks,

    david.
  • On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:

    > Hi,
    >
    > I have been trying to find this in the documentation and list
    > archives but without success so far. What is the best way to count
    > the number of characters in an NSString taking account of the fact
    > that some characters may take up multiple 16 bit slots. Using "-
    > (NSUInteger)length" is thus not the right way.

    If I am reading you right, you are saying that -length will give you
    the wrong results because some characters in Unicode are represented
    by multibyte sequences. This is incorrect: -length will give you the
    number of Unicode characters in a string, not the number of bytes.

    However, there are characters like "combining grave accent" (U+0300)
    that will usually not be displayed as a separate character, so there
    is a potential problem if you want to know how many characters will
    actually be displayed. The solution is to put the string into one of
    the composed Normalization Forms with either -
    precomposedStringWithCanonicalMapping (NFC) or -
    precomposedStringWithCompatibilityMapping (NFKC), depending on your
    needs. Then calling -length should give you the result you are looking
    for.

    For information on Unicode Normalization Forms, see http://unicode.org/reports/tr15/
    .

    -Michael
  • On Sun, 28 Sep 2008 03:27:48 -0500, Michael Gardner
    <gardnermj...> wrote:
    >
    > On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:
    >
    >> Hi,
    >>
    >> I have been trying to find this in the documentation and list
    >> archives but without success so far. What is the best way to count
    >> the number of characters in an NSString taking account of the fact
    >> that some characters may take up multiple 16 bit slots. Using "-
    >> (NSUInteger)length" is thus not the right way.
    >
    > If I am reading you right, you are saying that -length will give you
    > the wrong results because some characters in Unicode are represented
    > by multibyte sequences. This is incorrect: -length will give you the
    > number of Unicode characters in a string [...].

    This surprises me. I always thought that "length" gives you the
    number of shorts in the Utf-16 encoding of the string, which - as I
    used to think - is not the same as the number of Unicode code points
    in this string.

    But maybe you are right and I am confused.

    Kind regards,

    Gerriet.
  • On Sep 28, 2008, at 5:53 AM, Gerriet M. Denkmann wrote:

    >
    > On Sun, 28 Sep 2008 03:27:48 -0500, Michael Gardner <gardnermj...>
    >> wrote:
    >>
    >> On Sep 27, 2008, at 2:23 PM, David Niemeijer wrote:
    >>
    >>> Hi,
    >>>
    >>> I have been trying to find this in the documentation and list
    >>> archives but without success so far. What is the best way to count
    >>> the number of characters in an NSString taking account of the fact
    >>> that some characters may take up multiple 16 bit slots. Using "-
    >>> (NSUInteger)length" is thus not the right way.
    >>
    >> If I am reading you right, you are saying that -length will give you
    >> the wrong results because some characters in Unicode are represented
    >> by multibyte sequences. This is incorrect: -length will give you the
    >> number of Unicode characters in a string [...].
    >
    > This surprises me. I always thought that "length" gives you the
    > number of shorts in the Utf-16 encoding of the string, which - as I
    > used to think - is not the same as the number of Unicode code points
    > in this string.
    >
    > But maybe you are right and I am confused.

    Upon further investigation, I may be wrong. I based my assertion upon
    Apple's NSString documentation ("Returns the number of Unicode
    characters in the receiver"), and upon some quick tests I ran. But
    this reply made me look into the issue in greater depth.

    I re-did my tests more throughly, and it does appear that -length
    returns the number of 16-bit words (code units), not the number of
    Unicode characters (code points), in the string. If this is true, I
    would call it a bug either in the code or in the documentation, which
    David should submit to Apple.

    I apologize for the apparent misinformation in my previous, hasty reply.

    In the meanwhile, David, perhaps you can find a library that can work
    with UTF-8 strings. What are you using the length values for?

    -Michael
  • On Sat, Sep 27, 2008 at 3:23 PM, David Niemeijer <lists...> wrote:
    > Hi,
    >
    > I have been trying to find this in the documentation and list archives but
    > without success so far. What is the best way to count the number of
    > characters in an NSString taking account of the fact that some characters
    > may take up multiple 16 bit slots. Using "- (NSUInteger)length" is thus not
    > the right way. Using a series of calls to
    > "rangeOfComposedCharacterSequenceAtIndex:" seems like a possibility, but I
    > am not sure this would be the most efficient way. Is there a simple and
    > straightforward solution? I would like to be able to display the number of
    > characters in a string and not report the wrong results for foreign
    > languages (which I would get if I simply took the length of the string). I
    > need a solution that does not only work in Leopard (i.e. CFStringTokenizer
    > is not an option) and that does not require using the lower level
    > UCFindTextBreak.

    First I recommend you simply give up on the concept. You've stumbled
    into a tough problem, one which is not all that useful, and it may be
    better to skip it. Of course I don't know what you're using it for,
    but in general counting the number of characters in a string is not a
    useful thing to do.

    That said, if you want to continue, I'd suggest that you first figure
    out what you mean by a "character". You mention composed character
    sequences, but conceptually those are separate characters which happen
    to display as a single unit. Your description sounds like you want to
    catch UTF-16 surrogate pairs. Do you also want to catch things like é
    (accented e), which can be encoded as two separate unicode code
    points? Is space a character? Is a ligature like the "fi" glyph found
    in many fonts one character or many? Note that these are not
    rhetorical questions, and there is more than one right answer to each
    depending on what you want to do.

    Mike
  • Michael,

    On 28 sep 2008, at 14:41, Michael Gardner wrote:
    > Upon further investigation, I may be wrong. I based my assertion
    > upon Apple's NSString documentation ("Returns the number of Unicode
    > characters in the receiver"), and upon some quick tests I ran. But
    > this reply made me look into the issue in greater depth.
    >
    > I re-did my tests more throughly, and it does appear that -length
    > returns the number of 16-bit words (code units), not the number of
    > Unicode characters (code points), in the string. If this is true, I
    > would call it a bug either in the code or in the documentation,
    > which David should submit to Apple.

    i think the docs are clear. In the discussion section for "length" it
    says: "The number returned includes the individual characters of
    composed character sequences, so you cannot use this method to
    determine if a string will be visible when printed or how long it will
    appear."

    I did file a bug (ID 6253075) as you suggested, because I think there
    should be a simple API for this.

    > I apologize for the apparent misinformation in my previous, hasty
    > reply.

    Well, I mad an error too. i suggested that on 10.5 the
    CFStringTokenizer could be used, but only now noticed that it only
    supports larger units (words and up). Thus there is no easy API to
    count the number of characters in a way that surrogate pairs or other
    "long" unicode characters are treated as a single character.

    > In the meanwhile, David, perhaps you can find a library that can
    > work with UTF-8 strings. What are you using the length values for?

    I need to be able to display the number of characters to the user in a
    way that makes sense to them. If they see 3 I should report 3. I also
    need it to cut-off certain input to the number of "real" characters
    and should not generate results that only make sense for a language
    like English where each 16 bits equals a single character.

    Using some kind of UTF-8 library may be possible, but that would
    require converting all the time between UTF-16 and UTF-8, which is not
    efficient for a program that has to do a lot of these kind of
    calculations.

    david.
  • On Sun, Sep 28, 2008 at 2:17 PM, David Niemeijer <lists...> wrote:
    > I need to be able to display the number of characters to the user in a way
    > that makes sense to them. If they see 3 I should report 3. I also need it to
    > cut-off certain input to the number of "real" characters and should not
    > generate results that only make sense for a language like English where each
    > 16 bits equals a single character.

    Perhaps more information on why this is a requirement would be
    helpful.  Since it's apparent that you're going to be dealing with
    languages other than English, there won't be only one set of rules for
    you to follow.  For example, in Dutch, IJ is one letter.  In Spanish,
    you might treat ll and ch as one letter or not, depending on which
    region you're using and whether you're performing collation or just
    counting the number of letters.  If you can explain why counting
    characters is important to your app, we might be able to help you
    better.

    --Kyle Sluder
  • Users don't see characters, they see glyphs. If you want your count to
    maximally agree with user perception, you need to be counting glyphs,
    not characters.

    See NSLayoutManager, esp:

    - (NSRange)glyphRangeForCharacterRange:(NSRange)charRange

      -- and friends.

    If you are showing strings to the user, do so in an NSTextView, and
    then query the NSLayoutManager associated with that view.

    On Sep 27, 2008, at 9:37 PM, <cocoa-dev-request...> wrote:

    > Message: 15
    > Date: Sat, 27 Sep 2008 21:23:25 +0200
    > From: David Niemeijer <lists...>
    > Subject: How to count composed characters in NSString?
    > To: <cocoa-dev...>
    > Message-ID: <8A34E3C1-EE83-4180-B524-E262DDDF768A...>
    > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
    >
    > Hi,
    >
    > I have been trying to find this in the documentation and list archives
    > but without success so far. What is the best way to count the number
    > of characters in an NSString taking account of the fact that some
    > characters may take up multiple 16 bit slots. Using "-
    > (NSUInteger)length" is thus not the right way. Using a series of calls
    > to "rangeOfComposedCharacterSequenceAtIndex:" seems like a
    > possibility, but I am not sure this would be the most efficient way.
    > Is there a simple and straightforward solution? I would like to be
    > able to display the number of characters in a string and not report
    > the wrong results for foreign languages (which I would get if I simply
    > took the length of the string). I need a solution that does not only
    > work in Leopard (i.e. CFStringTokenizer is not an option) and that
    > does not require using the lower level UCFindTextBreak.
    >
    > Thanks,
    >
    > david.
  • On Sep 28, 2008, at 12:02 PM, <cocoa-dev-request...> wrote:

    > ----------------------------------------------------------------------
    >
    > Message: 1
    > Date: Sun, 28 Sep 2008 20:17:26 +0200
    > From: David Niemeijer <lists...>
    > Subject: Re: How to count composed characters in NSString?
    > To: Cocoa-Dev List <cocoa-dev...>
    > Message-ID: <B24844F1-78CF-4C28-A602-4AAE64D6C3A8...>
    > Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
    >
    > Michael,
    >
    > On 28 sep 2008, at 14:41, Michael Gardner wrote:
    >> Upon further investigation, I may be wrong. I based my assertion
    >> upon Apple's NSString documentation ("Returns the number of Unicode
    >> characters in the receiver"), and upon some quick tests I ran. But
    >> this reply made me look into the issue in greater depth.
    >>
    >> I re-did my tests more throughly, and it does appear that -length
    >> returns the number of 16-bit words (code units), not the number of
    >> Unicode characters (code points), in the string. If this is true, I
    >> would call it a bug either in the code or in the documentation,
    >> which David should submit to Apple.
    >
    > i think the docs are clear. In the discussion section for "length" it
    > says: "The number returned includes the individual characters of
    > composed character sequences, so you cannot use this method to
    > determine if a string will be visible when printed or how long it will
    > appear."
    >
    > I did file a bug (ID 6253075) as you suggested, because I think there
    > should be a simple API for this.
    >
    >> I apologize for the apparent misinformation in my previous, hasty
    >> reply.
    >
    > Well, I mad an error too. i suggested that on 10.5 the
    > CFStringTokenizer could be used, but only now noticed that it only
    > supports larger units (words and up). Thus there is no easy API to
    > count the number of characters in a way that surrogate pairs or other
    > "long" unicode characters are treated as a single character.

    David,
    Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the
    kinds of text boundaries that I think you are interested in. You would
    just need to iterate over the string calling this for each iteration
    to find the next boundary.

    -Peter Edberg, Apple
  • On Sep 28, 2008, at 3:05 PM, Peter Edberg wrote:
    >
    > David,
    > Check out CFStringGetRangeOfComposedCharactersAtIndex. It finds the
    > kinds of text boundaries that I think you are interested in. You
    > would just need to iterate over the string calling this for each
    > iteration  to find the next boundary.

    Apologies, I see now that your in your original post you already
    mentioned rangeOfComposedCharacterSequenceAtIndex. That would be
    preferred :-)
    -Peter
  • On Sep 28, 2008, at 1:17 PM, David Niemeijer wrote:

    > Michael,
    >
    > On 28 sep 2008, at 14:41, Michael Gardner wrote:
    >> Upon further investigation, I may be wrong. I based my assertion
    >> upon Apple's NSString documentation ("Returns the number of Unicode
    >> characters in the receiver"), and upon some quick tests I ran. But
    >> this reply made me look into the issue in greater depth.
    >>
    >> I re-did my tests more throughly, and it does appear that -length
    >> returns the number of 16-bit words (code units), not the number of
    >> Unicode characters (code points), in the string. If this is true, I
    >> would call it a bug either in the code or in the documentation,
    >> which David should submit to Apple.
    >
    > i think the docs are clear. In the discussion section for "length"
    > it says: "The number returned includes the individual characters of
    > composed character sequences, so you cannot use this method to
    > determine if a string will be visible when printed or how long it
    > will appear."

    But composed character sequences aren't the problem; surrogate pairs
    are. Composed character sequences can be taken care of by using either
    -precomposedStringWithCanonicalMapping or -
    precomposedStringWithCompatibilityMapping. In my opinion, -length
    should take surrogate pairs into account, which is what the docs seem
    to imply.

    -Michael
  • Sent from my iPhone

    On Sep 28, 2008, at 21:52, Michael Gardner <gardnermj...> wrote:

    > On Sep 28, 2008, at 1:17 PM, David Niemeijer wrote:
    >
    >> Michael,
    >>
    >> On 28 sep 2008, at 14:41, Michael Gardner wrote:
    >>> Upon further investigation, I may be wrong. I based my assertion
    >>> upon Apple's NSString documentation ("Returns the number of
    >>> Unicode characters in the receiver"), and upon some quick tests I
    >>> ran. But this reply made me look into the issue in greater depth.
    >>>
    >>> I re-did my tests more throughly, and it does appear that -length
    >>> returns the number of 16-bit words (code units), not the number of
    >>> Unicode characters (code points), in the string. If this is true,
    >>> I would call it a bug either in the code or in the documentation,
    >>> which David should submit to Apple.
    >>
    >> i think the docs are clear. In the discussion section for "length"
    >> it says: "The number returned includes the individual characters of
    >> composed character sequences, so you cannot use this method to
    >> determine if a string will be visible when printed or how long it
    >> will appear."
    >
    > But composed character sequences aren't the problem; surrogate pairs
    > are. Composed character sequences can be taken care of by using
    > either -precomposedStringWithCanonicalMapping or -
    > precomposedStringWithCompatibilityMapping.

    Not true. Not all possible combinations of base characters followed by
    combining characters even have a mapping to a single precimposed
    character.

    Essentially, what one wants to do is count all of the characters with
    a combining class of zero, however, even this isn't without issues.

    > In my opinion, -length should take surrogate pairs into account,
    > which is what the docs seem to imply.
    >
    > -Michael
  • On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:

    > I need to be able to display the number of characters to the user in
    > a way that makes sense to them. If they see 3 I should report 3. I
    > also need it to cut-off certain input to the number of "real"
    > characters and should not generate results that only make sense for
    > a language like English where each 16 bits equals a single character.

    What you are describing is the notion that Unicode sometimes refers to
    as a "user-perceived character", which in general can be somewhat
    ambiguous, since different users may have different perceptions, and
    since there are writing systems in which character boundaries are not
    at all similar to those in English.  To handle this sort of issue
    programmatically, Unicode defines what are known as "grapheme
    clusters", but there is not a single notion of grapheme cluster; there
    are several such notions, depending on precisely what it is you want.

    These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries>, which gives a number of examples and some algorithms for
    determining grapheme cluster boundaries.  Grapheme clusters are
    similar to but not quite identical to composed character sequences.
    For some purposes composed character sequences may be sufficient;
    NSString gives prominence to the notion of composed character
    sequence, because that is the most important concept for arbitrary
    text processing, but if you are really interested in user-perceived
    characters you may wish to use something else.

    The most problematic scripts for this sort of determination include:
    handwriting-based scripts such as Arabic, in which (depending on the
    ligatures used in a particular font) character boundaries may not be
    readily perceptible; composed scripts such as Hangul, in which the
    script elements are in turn composed of smaller, individually
    meaningful graphic elements; and scripts involving reordering and
    combining, such as Devanagari and other Indic or Indic-influenced
    scripts.

    There is still another similar but not quite identical notion, which
    is used for determining the number and position of insertion points
    during editing.  In Leopard, NSLayoutManager has API support for
    determining insertion point positions within a line of text as it is
    laid out.  Note that insertion point boundaries are not identical to
    glyph boundaries; a ligature glyph in some cases, such as an "fi"
    ligature in Latin script, may require an internal insertion point on a
    user-perceived character boundary.

    Douglas Davidson
  • On Mon, Sep 29, 2008 at 12:52 AM, Michael Gardner <gardnermj...> wrote:
    > But composed character sequences aren't the problem; surrogate pairs are.
    > Composed character sequences can be taken care of by using either
    > -precomposedStringWithCanonicalMapping or
    > -precomposedStringWithCompatibilityMapping. In my opinion, -length should
    > take surrogate pairs into account, which is what the docs seem to imply.

    The NSString API is inherently either UCS-2 or UTF-16. As UCS-2
    doesn't cover all of Unicode, it ends up being UTF-16.

    The API defines NSString as an ordered collection of 16-bit unichars.
    The length is necessarily the number of 16-bit unichars in the string,
    nothing else would really make sense. Short of creating a new API that
    works on pure Unicode code points, the only thing to do is to document
    the fact that -length gives you the number of UTF-16 code units, not
    the number of Unicode characters.

    (As an aside, changing the API to work with Unicode code points is
    something I don't think is really worthwhile. Aside from having to
    support the old API which would no doubt be a great deal of hassle,
    Unicode code points are pretty useless on their own anyway. You always
    end up having to convert and deal with precomposed characters an all
    the rest of the Unicode mess regardless. Adding surrogate pairs to all
    of that really doesn't increase the burden any further.)

    Mike
  • Hi Douglas and Peter,

    On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
    > On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
    >
    >> I need to be able to display the number of characters to the user
    >> in a way that makes sense to them. If they see 3 I should report 3.
    >> I also need it to cut-off certain input to the number of "real"
    >> characters and should not generate results that only make sense for
    >> a language like English where each 16 bits equals a single character.
    >
    > What you are describing is the notion that Unicode sometimes refers
    > to as a "user-perceived character", which in general can be somewhat
    > ambiguous, since different users may have different perceptions, and
    > since there are writing systems in which character boundaries are
    > not at all similar to those in English.  To handle this sort of
    > issue programmatically, Unicode defines what are known as "grapheme
    > clusters", but there is not a single notion of grapheme cluster;
    > there are several such notions, depending on precisely what it is
    > you want.
    >
    > These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
    > >, which gives a number of examples and some algorithms for
    > determining grapheme cluster boundaries.  Grapheme clusters are
    > similar to but not quite identical to composed character sequences.
    > For some purposes composed character sequences may be sufficient;
    > NSString gives prominence to the notion of composed character
    > sequence, because that is the most important concept for arbitrary
    > text processing, but if you are really interested in user-perceived
    > characters you may wish to use something else.

    Thanks for your clarification. It is indeed the "grapheme clusters"
    that I am after. I need to be able to do things such as capitalize the
    first letter of a string and in doing statistical text analysis
    determine the number of "characters" of a text string. This
    description from the URL you pointed at fits my use quite well:
    "Grapheme cluster boundaries are important for collation, regular
    expressions, UI interactions (such as mouse selection, arrow key
    movement, backspacing), segmentation for vertical text, identification
    of boundaries for first-letter styling, and counting “character”
    positions within text." Using glyphs in this case is not appropriate
    as in text analysis the text itself is not displayed, nor is using
    [aString length] because it just reports the number of UTF-16 code
    units. I realize there is no perfect approach, but I am just trying to
    do something that brings me closest to what a user would expect.

    Peter confirmed earlier that
    CFStringGetRangeOfComposedCharactersAtIndex would be the way to go for
    me. But, if I read Douglas' comment then I am beginning to wonder
    whether this is the equivalent of UCFindTextBreak's
    kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past I
    used to use UCFindTextBreak with kUCTextBreakClusterMask, but unlike
    NSString, UCFindTextBreak is not available on one of the platforms I
    need to support, so what would be the right way to get at the cluster
    breaks using the NSString API? (Please contact me off list if you need
    further clarification.)

    Cheers,

    david.
  • On Sep 29, 2008, at 9:27 PM, David Niemeijer wrote:

    > Hi Douglas and Peter,
    >
    > On Sep 29, 2008, at 6:39 PM, Douglas Davidson wrote:
    >> On Sep 28, 2008, at 11:17 AM, David Niemeijer wrote:
    >>
    >>> I need to be able to display the number of characters to the user
    >>> in a way that makes sense to them. If they see 3 I should report
    >>> 3. I also need it to cut-off certain input to the number of "real"
    >>> characters and should not generate results that only make sense
    >>> for a language like English where each 16 bits equals a single
    >>> character.
    >>
    >> What you are describing is the notion that Unicode sometimes refers
    >> to as a "user-perceived character", which in general can be
    >> somewhat ambiguous, since different users may have different
    >> perceptions, and since there are writing systems in which character
    >> boundaries are not at all similar to those in English.  To handle
    >> this sort of issue programmatically, Unicode defines what are known
    >> as "grapheme clusters", but there is not a single notion of
    >> grapheme cluster; there are several such notions, depending on
    >> precisely what it is you want.
    >>
    >> These issues are covered in detail in Unicode Standard Annex #29, <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
    >> >, which gives a number of examples and some algorithms for
    >> determining grapheme cluster boundaries.  Grapheme clusters are
    >> similar to but not quite identical to composed character
    >> sequences.  For some purposes composed character sequences may be
    >> sufficient; NSString gives prominence to the notion of composed
    >> character sequence, because that is the most important concept for
    >> arbitrary text processing, but if you are really interested in user-
    >> perceived characters you may wish to use something else.
    >
    > Thanks for your clarification. It is indeed the "grapheme clusters"
    > that I am after. I need to be able to do things such as capitalize
    > the first letter of a string and in doing statistical text analysis
    > determine the number of "characters" of a text string. This
    > description from the URL you pointed at fits my use quite well:
    > "Grapheme cluster boundaries are important for collation, regular
    > expressions, UI interactions (such as mouse selection, arrow key
    > movement, backspacing), segmentation for vertical text,
    > identification of boundaries for first-letter styling, and counting
    > “character” positions within text." Using glyphs in this case is not
    > appropriate as in text analysis the text itself is not displayed,
    > nor is using [aString length] because it just reports the number of
    > UTF-16 code units. I realize there is no perfect approach, but I am
    > just trying to do something that brings me closest to what a user
    > would expect.
    >
    > Peter confirmed earlier that
    > CFStringGetRangeOfComposedCharactersAtIndex would be the way to go
    > for me. But, if I read Douglas' comment then I am beginning to
    > wonder whether this is the equivalent of UCFindTextBreak's
    > kUCTextBreakCharMask and not of kUCTextBreakClusterMask. In the past
    > I used to use UCFindTextBreak with kUCTextBreakClusterMask, but
    > unlike NSString, UCFindTextBreak is not available on one of the
    > platforms I need to support, so what would be the right way to get
    > at the cluster breaks using the NSString API? (Please contact me off
    > list if you need further clarification.)
    >
    > Cheers,
    >
    > david.

    David,
    CFStringGetRangeOfComposedCharactersAtIndex and -[NSString
    rangeOfComposedCharacterSequenceAtIndex:] are the modern replacements
    for UCFindTextBreak with kUCTextBreakClusterMask and indeed they now
    are closer to the original intent of kUCTextBreakClusterMask that the
    current implementation of kUCTextBreakClusterMask is (since
    UCFindTextBreak was converted to follow Unicode/ICU default text
    segmentation rules).

    The modern functions treat all of the following as a cluster:
    - A surrogate pair (of course, since it is a single character);
    - A base character followed by a sequence of combining marks (whether
    or not this is something that would be composed under NFC);
    - A Hangul syllable expressed as a sequence of conjoining jamo;
    - An Indic consonant cluster such as consonant + virama + consonant +
    vowel matra. It is this latter cluster that is no longer treated as a
    single entity by  UCFindTextBreak with kUCTextBreakClusterMask.

    -Peter
  • Hi Peter,

    On Sep 30, 2008, at 7:58 AM, Peter Edberg wrote:
    > CFStringGetRangeOfComposedCharactersAtIndex and -[NSString
    > rangeOfComposedCharacterSequenceAtIndex:] are the modern
    > replacements for UCFindTextBreak with kUCTextBreakClusterMask and
    > indeed they now are closer to the original intent of
    > kUCTextBreakClusterMask that the current implementation of
    > kUCTextBreakClusterMask is (since UCFindTextBreak was converted to
    > follow Unicode/ICU default text segmentation rules).
    >
    > The modern functions treat all of the following as a cluster:
    > - A surrogate pair (of course, since it is a single character);
    > - A base character followed by a sequence of combining marks
    > (whether or not this is something that would be composed under NFC);
    > - A Hangul syllable expressed as a sequence of conjoining jamo;
    > - An Indic consonant cluster such as consonant + virama + consonant
    > + vowel matra. It is this latter cluster that is no longer treated
    > as a single entity by  UCFindTextBreak with kUCTextBreakClusterMask.

    Ok, understood. This looks good. Based on the discussion I have
    updated my bug report 6253075. I think a "convenience" method that
    returns the cluster count would be very useful as it is probably
    faster than if we manually role a counter method using repeated calls
    to rangeOfComposedCharacterSequenceAtIndex and because it will, by its
    simple availability, reduce some of the confusion that I sense on this
    list as to what the most appropriate way is to count "characters".
    There would be "length" to count the number of UTF-16 units and a
    "numberOfCharacters" to count the clusters that are closest to the
    human conception of characters.

    Thanks,

    david.
previous month september 2008 next month
MTWTFSS
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30          
Go to today