Skip navigation.
 
mlRe: Unicode canonical decomposed form and text encoding
FROM : Aki Inoue
DATE : Wed Jan 15 01:39:06 2003

Renaud,

I think we're talking in the same line.

\\300 is 0x00C0 in octal and is "A grave".
It is usually called the precomposed form.

And "A \U0300" is the decomposed form.

> So I used getCharacters but somehitng isn't working still. I think I
> may have asked part of my question backwards. Boy, Unicode is not too
> simple! Perhaps with an example.

Exactly what's not working ?

Aki

On 2003.1.14, at 04:11  PM, Renaud Boisjoly wrote:

> Hi again
>
> So I used getCharacters but somehitng isn't working still. I think I
> may have asked part of my question backwards. Boy, Unicode is not too
> simple! Perhaps with an example.
>
> Say the string I need to convert is "A acute". It first looks like:
> \\300
>
> But what I need is:
> A\\u0300
>
> I'm not sure yet how each is supposed to be called.
>
> I get the feeling that the routine you so kindly put together actually
> does the opposite... is this possible? If so, I tried inverting some
> of the parameters in CreateTextConverter, but it fails to convert
> anything... any clues?
>
> Thanks again to all for helping out!
>
> Renaud
>
> On Tuesday, January 14, 2003, at 05:44  PM, Aki Inoue wrote:
>

>> Renaud,
>>
>> You can use getCharacters: to bulk-get characters from NSString.
>>
>> One thing to note if you're using stack buffer in a loop as in your
>> original example.
>>
>> Depending on your needs in decomposed format, you have to be a little
>> bit more careful at the end of each buffer run.
>>
>> For example, let's assume your source NSString contains the following
>> character sequence "U0104 U0300" LATIN CAPITAL LETTER A WITH OGONEK
>> and COMBINING GRAVE ACCENT.  "!" (This should display correctly in
>> Mail.app).
>> When decompose, they can be either "U0041 U0328 U0300" or "U0041
>> U0300 U0328".  They are both perfectly legal Unicode character
>> sequences, but only the latter is canonically decomposed format.
>> Back to the NSString with these character sequences, you won't get
>> the canonical format if your working buffer ends between U0104 and
>> U0300 since TEC cannot know the next character in that case.
>>
>> So, if you want to have canonically decomposed format (not just
>> decomposed), you need to make sure your working buffer ends BEFORE a
>> base character (![[NSCharacter nonBaseCharacterSet]
>> characterIsMember:theChar]).  You don't have to worry about
>> surrogates since pre-Jaguar TEC doesn't recognize them.
>>
>> Aki
>>
>> On 2003.1.14, at 01:08  PM, Renaud Boisjoly wrote:
>>

>>> Hi again
>>>
>>> Ok, I think it will work, but I do have a last newbie question to
>>> ask if I can...
>>>
>>> I've managed to convert from the UniChar result to an NSString, but
>>> I'm not clear on how to efficiently do the reverse. My original
>>> string is in an NSString and I guess I need to convert it to
>>> UniChar... but being pretty unexperienced, this looks like a mystery
>>> to me. Do I need to iterate through each character using
>>> characterAtIndex and add them to characters[] one by one? Should I
>>> use an NSScanner? Is there an immensely obvious way to do this and
>>> I'm just not seeing it (probably). I now its probably something I
>>> should know, but considering I've only been programming for a year
>>> or so except for stuff like AppleScript, I miss a lot of things.
>>>
>>> My current idea is a for loop using characterAtIndex to add each
>>> character...
>>>
>>> Thanks for your time if you can afford it.
>>>
>>> Renaud
>>>
>>> On Tuesday, January 14, 2003, at 02:39  PM, Aki Inoue wrote:
>>>

>>>> #import <Foundation/Foundation.h>
>>>>
>>>> static UniChar characters[] = ; // LATIN CAPITAL LETTER A
>>>> WITH GRAVE
>>>>
>>>> #define MAX_BUFFER_LENGTH (100)
>>>>
>>>> int main (int argc, const char * argv[]) {
>>>>    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
>>>>    UnicodeToTextInfo textInfo;
>>>>    UnicodeMapping mapping =
>>>> {CreateTextEncoding(kTextEncodingUnicodeDefault,
>>>> kTextEncodingDefaultVariant, kUnicode16BitFormat),
>>>> CreateTextEncoding(kTextEncodingUnicodeDefault,
>>>> kUnicodeCanonicalDecompVariant, kUnicode16BitFormat),
>>>> kUnicodeUseLatestMapping};
>>>>    UniChar buffer[MAX_BUFFER_LENGTH];
>>>>    ByteCount inputRead, outputLen;
>>>>    OSStatus status;
>>>>
>>>>    status = CreateUnicodeToTextInfo(&mapping, &textInfo);
>>>>    if (noErr != status) {
>>>>        NSLog(@"Failed to create UnicodeToTextInfo");
>>>>        exit(1);
>>>>    }
>>>>
>>>>    status = ConvertFromUnicodeToText(textInfo, sizeof(characters),
>>>> characters, kTECKeepInfoFixMask, 0, NULL, NULL, NULL,
>>>> MAX_BUFFER_LENGTH * sizeof(UniChar), &inputRead, &outputLen, >>>
>>>> buffer);
>>>>    if (noErr != status) {
>>>>        NSLog(@"Failed to convert string");
>>>>        exit(1);
>>>>    }
>>>>
>>>>    DisposeUnicodeToTextInfo(&textInfo);
>>>>
>>>>    [pool release];
>>>>    return 0;
>>>> }

>> _______________________________________________
>> cocoa-dev mailing list | <email_removed>
>> Help/Unsubscribe/Archives:
>> http://www.lists.apple.com/mailman/listinfo/cocoa-dev
>> Do not post admin requests to the list. They will be ignored.

_______________________________________________
cocoa-dev mailing list | <email_removed>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Related mailsAuthorDate
mlUnicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 04:54
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 14, 20:39
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 20:45
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 22:08
mlRe: Unicode canonical decomposed form and text encoding Clark S. Cox III Jan 14, 23:05
mlRe: Unicode canonical decomposed form and text encoding Dietrich Epp Jan 14, 23:13
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 14, 23:44
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 01:11
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 15, 01:39
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:26
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:43
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 15, 02:44
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:57