display unicode string one character at a time, not at simple as it seems
-
Hello,
I need to break a string down into individual characters.
In English that's pretty easy.
But in some languages what a user perceives as a single block is
actually a base character plus accents plus vowel markers plus tone
markers plus...
eg: เ�
is made of
U+0E40 ( เ ) thai character sara e
U+0E01 ( � ) thai character ko kai
To help with this NSString has the methods:
rangeOfComposedCharacterSequencesForRange:
rangeOfComposedCharacterSequenceAtIndex:
and CFString has:
CFStringGetRangeOfComposedCharactersAtIndex.
but then some languages - like german, will sometimes combine certain
blocks together
so SS becomes ß
the document http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
seems to have some good information about this,
so I'm not completely lost as to how to proceed. But this strikes me
as one of those problems that other people have struck many times
before,
any suggestions would be deeply appreciated.
thank you
mathew -
On Tue, Feb 17, 2009 at 2:13 PM, mathew davis <compoundeye.dev...> wrote:
> Hello,
>
> I need to break a string down into individual characters.
>
> In English that's pretty easy.
>
> But in some languages what a user perceives as a single block is actually a
> base character plus accents plus vowel markers plus tone markers plus...
>
>
> eg: เ�
>
> is made of
>
> U+0E40 ( เ ) thai character sara e
> U+0E01 ( � ) thai character ko kai
>
>
> To help with this NSString has the methods:
>
> rangeOfComposedCharacterSequencesForRange:
> rangeOfComposedCharacterSequenceAtIndex:
>
> and CFString has:
>
> CFStringGetRangeOfComposedCharactersAtIndex.
>
>
>
> but then some languages - like german, will sometimes combine certain
> blocks together
>
> so SS becomes ß
How, *exactly*, are the aforementioned methods/functions not working for you?
--
Clark S. Cox III
<clarkcox3...> -
Hi clark,
turns out I had really misunderstood something about how some
characters such as the german ß were stored.
I thought it was much more complex that it really is.
I thought the single character ß was composed of two grapheme clusters.
Actually:
rangeOfComposedCharacterSequencesForRange:
rangeOfComposedCharacterSequenceAtIndex:
will do their job just fine
but I wouldn't have figured that out if you hadn't asked me, so thank
you.
cheers
mathew
> On Tue, Feb 17, 2009 at 2:13 PM, mathew davis <compoundeye.dev...>
>> wrote:
>> Hello,
>>
>> I need to break a string down into individual characters.
>>
>> In English that's pretty easy.
>>
>> But in some languages what a user perceives as a single block is
>> actually a
>> base character plus accents plus vowel markers plus tone markers
>> plus...
>>
>>
>> eg: เ�
>>
>> is made of
>>
>> U+0E40 ( เ ) thai character sara e
>> U+0E01 ( � ) thai character ko kai
>>
>>
>> To help with this NSString has the methods:
>>
>> rangeOfComposedCharacterSequencesForRange:
>> rangeOfComposedCharacterSequenceAtIndex:
>>
>> and CFString has:
>>
>> CFStringGetRangeOfComposedCharactersAtIndex.
>>
>>
>>
>> but then some languages - like german, will sometimes combine
>> certain
>> blocks together
>>
>> so SS becomes ß
>
> How, *exactly*, are the aforementioned methods/functions not working
> for you?
>
> --
> Clark S. Cox III
> <clarkcox3...>
-
On Tue, Feb 17, 2009 at 5:13 PM, mathew davis <compoundeye.dev...> wrote:
> Hello,
>
> I need to break a string down into individual characters.
>
> In English that's pretty easy.
>
> But in some languages what a user perceives as a single block is actually a
> base character plus accents plus vowel markers plus tone markers plus...
First, define what you mean by "character", because it's not at all obvious.
In Unicode terms, dividing a string into characters is easy. The
characterAtIndex: call will give you that. However as you're
discovering, individual unicode characters are not particularly
useful.
You may be after composed character sequences, as you mentioned. You
may be after glyphs, which are individual displayable units, but note
that glyphs can often be larger than you might expect and can depend
on the font being used. You may be after something else altogether.
Figure out exactly what you desire, and why you desire it, and then
tell us and we might be able to figure out how you can get it.
Mike -
Hi Micheal,
thanks for writing
I need to write text onto tiles which will be laid out horizontally.
Ideally these tiles shouldn't be much wider than an uppercase W. The
tiles cannot overlap.
ideally I'd like to break the string in to:
the smallest possible segments which do not need to horizontally
overlap when displayed in order to make sense to a reader.
so the composed character sequence:
g̈
U+0067 ( g ) latin small letter g
U+0308 ( ̈ ) combining diaeresis
would need to be one substring because the glyphs overlap horizontally
the composed character sequence in Thai:
เ�
U+0E40 ( เ ) thai character sara e
U+0E01 ( � ) thai character ko kai
would ideally be broken into two substrings, but:
�ึ
would have to be one substring
(my apologies to thai readers if I'm matching vowels with consonants
which can't go together or accidentally writing obscenities)
In Thai you can get composed character sequences that look a bit like
เà¸?à¸à¸°
So to fit that on a tile not much wider than 'W'
I'd need to make it in a really small font.
It would be much better if I could break it into four separate pieces
I'm not worried about letters which normally overlap slightly due to
kerning,
what i'm worried about is being legible
I'm also not worried about using special ligatures, such as can be
used for joining f and l, unless due to a special feature of the
language it needs to be preserved to be grammatically correct.
At the moment i'm using rangeOfComposedCharacterSequenceAtIndex.
And if a sequence is too wide for my tile I scale it down.
does that make sense?
cheers
Mathew


