display unicode string one character at a time, not at simple as it seems

  • Hello,

    I need to break a string down into individual characters.

    In English that's pretty easy.

    But in some languages what a user perceives as a single block is
    actually a base character plus  accents plus vowel markers plus tone
    markers plus...

    eg:     à¹€à¸?

    is made of

    U+0E40 ( เ ) thai character sara e
    U+0E01 ( � ) thai character ko kai

    To help with this NSString has the methods:

    rangeOfComposedCharacterSequencesForRange:
    rangeOfComposedCharacterSequenceAtIndex:

    and CFString has:

    CFStringGetRangeOfComposedCharactersAtIndex.

    but then some languages -  like german, will sometimes combine certain
    blocks together

    so SS becomes ß

    the document http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
      seems to have some good information about this,
    so I'm not completely lost as to how to proceed. But this strikes me
    as one of those problems that other people have struck many times
    before,

    any suggestions would be deeply appreciated.

    thank you

    mathew
  • On Tue, Feb 17, 2009 at 2:13 PM, mathew davis <compoundeye.dev...> wrote:
    > Hello,
    >
    > I need to break a string down into individual characters.
    >
    > In English that's pretty easy.
    >
    > But in some languages what a user perceives as a single block is actually a
    > base character plus  accents plus vowel markers plus tone markers plus...
    >
    >
    > eg:    เà¸?
    >
    > is made of
    >
    > U+0E40 ( เ ) thai character sara e
    > U+0E01 ( � ) thai character ko kai
    >
    >
    > To help with this NSString has the methods:
    >
    > rangeOfComposedCharacterSequencesForRange:
    > rangeOfComposedCharacterSequenceAtIndex:
    >
    > and CFString has:
    >
    > CFStringGetRangeOfComposedCharactersAtIndex.
    >
    >
    >
    > but then some languages -  like german, will sometimes combine certain
    > blocks together
    >
    > so SS becomes ß

    How, *exactly*, are the aforementioned methods/functions not working for you?

    --
    Clark S. Cox III
    <clarkcox3...>
  • Hi clark,

    turns out I had really misunderstood something about  how some
    characters such as the german ß were stored.

    I thought it was much more complex that it really is.

    I thought the single character ß was composed of two grapheme clusters.

    Actually:

          rangeOfComposedCharacterSequencesForRange:
          rangeOfComposedCharacterSequenceAtIndex:

    will do their job just fine

    but I wouldn't have figured that out if you hadn't asked me, so thank
    you.

    cheers

    mathew

    > On Tue, Feb 17, 2009 at 2:13 PM, mathew davis <compoundeye.dev...>
    >> wrote:
    >> Hello,
    >>
    >> I need to break a string down into individual characters.
    >>
    >> In English that's pretty easy.
    >>
    >> But in some languages what a user perceives as a single block is
    >> actually a
    >> base character plus  accents plus vowel markers plus tone markers
    >> plus...
    >>
    >>
    >> eg:    เà¸?
    >>
    >> is made of
    >>
    >> U+0E40 ( เ ) thai character sara e
    >> U+0E01 ( � ) thai character ko kai
    >>
    >>
    >> To help with this NSString has the methods:
    >>
    >> rangeOfComposedCharacterSequencesForRange:
    >> rangeOfComposedCharacterSequenceAtIndex:
    >>
    >> and CFString has:
    >>
    >> CFStringGetRangeOfComposedCharactersAtIndex.
    >>
    >>
    >>
    >> but then some languages -  like german, will sometimes combine
    >> certain
    >> blocks together
    >>
    >> so SS becomes ß
    >
    > How, *exactly*, are the aforementioned methods/functions not working
    > for you?
    >
    > --
    > Clark S. Cox III
    > <clarkcox3...>
  • On Tue, Feb 17, 2009 at 5:13 PM, mathew davis <compoundeye.dev...> wrote:
    > Hello,
    >
    > I need to break a string down into individual characters.
    >
    > In English that's pretty easy.
    >
    > But in some languages what a user perceives as a single block is actually a
    > base character plus  accents plus vowel markers plus tone markers plus...

    First, define what you mean by "character", because it's not at all obvious.

    In Unicode terms, dividing a string into characters is easy. The
    characterAtIndex: call will give you that. However as you're
    discovering, individual unicode characters are not particularly
    useful.

    You may be after composed character sequences, as you mentioned. You
    may be after glyphs, which are individual displayable units, but note
    that glyphs can often be larger than you might expect and can depend
    on the font being used. You may be after something else altogether.
    Figure out exactly what you desire, and why you desire it, and then
    tell us and we might be able to figure out how you can get it.

    Mike
  • Hi Micheal,

    thanks for writing

    I need to write text onto tiles which will be laid out horizontally.

    Ideally these tiles shouldn't be much wider than an uppercase W. The
    tiles cannot overlap.

    ideally I'd like to break the string in to:

    the smallest possible segments which do not need to horizontally
    overlap when displayed in order to make sense to a reader.

    so the composed character sequence:

    g̈

    U+0067 ( g ) latin small letter g
    U+0308 ( ̈ ) combining diaeresis

    would need to be one substring because the glyphs overlap horizontally

    the composed character sequence in Thai:

    เ�

    U+0E40 ( เ ) thai character sara e
    U+0E01 ( � ) thai character ko kai

    would ideally be broken into two substrings, but:

        à¸?ึ

    would have to be one substring

    (my apologies to thai readers if I'm matching vowels with consonants
    which can't go together or accidentally writing obscenities)

    In Thai you can get composed character sequences that look a bit like

    เ�อะ

    So to fit that on a tile not much wider than 'W'
    I'd need to make it in a really small font.
    It would be much better if I could break it into four separate pieces

    I'm not worried about letters which normally overlap slightly due to
    kerning,
      what i'm worried about is being legible

    I'm also not worried about using special ligatures, such as can be
    used for joining f and l, unless due to a special feature of the
    language it needs to be preserved to be grammatically correct.

    At the moment i'm using rangeOfComposedCharacterSequenceAtIndex.

    And if a sequence is too wide for my tile I scale it down.

    does that make sense?

    cheers

    Mathew
previous month february 2009 next month
MTWTFSS
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28  
Go to today