Skip navigation.
 
mlRe: Unicode canonical decomposed form and text encoding
FROM : Aki Inoue
DATE : Tue Jan 14 23:44:37 2003

Renaud,

You can use getCharacters: to bulk-get characters from NSString.

One thing to note if you're using stack buffer in a loop as in your
original example.

Depending on your needs in decomposed format, you have to be a little
bit more careful at the end of each buffer run.

For example, let's assume your source NSString contains the following
character sequence "U0104 U0300" LATIN CAPITAL LETTER A WITH OGONEK and
COMBINING GRAVE ACCENT.  "!" (This should display correctly in
Mail.app).
When decompose, they can be either "U0041 U0328 U0300" or "U0041 U0300
U0328".  They are both perfectly legal Unicode character sequences, but
only the latter is canonically decomposed format.
Back to the NSString with these character sequences, you won't get the
canonical format if your working buffer ends between U0104 and U0300
since TEC cannot know the next character in that case.

So, if you want to have canonically decomposed format (not just
decomposed), you need to make sure your working buffer ends BEFORE a
base character (![[NSCharacter nonBaseCharacterSet]
characterIsMember:theChar]).  You don't have to worry about surrogates
since pre-Jaguar TEC doesn't recognize them.

Aki

On 2003.1.14, at 01:08  PM, Renaud Boisjoly wrote:

> Hi again
>
> Ok, I think it will work, but I do have a last newbie question to ask
> if I can...
>
> I've managed to convert from the UniChar result to an NSString, but
> I'm not clear on how to efficiently do the reverse. My original string
> is in an NSString and I guess I need to convert it to UniChar... but
> being pretty unexperienced, this looks like a mystery to me. Do I need
> to iterate through each character using characterAtIndex and add them
> to characters[] one by one? Should I use an NSScanner? Is there an
> immensely obvious way to do this and I'm just not seeing it
> (probably). I now its probably something I should know, but
> considering I've only been programming for a year or so except for
> stuff like AppleScript, I miss a lot of things.
>
> My current idea is a for loop using characterAtIndex to add each
> character...
>
> Thanks for your time if you can afford it.
>
> Renaud
>
> On Tuesday, January 14, 2003, at 02:39  PM, Aki Inoue wrote:
>

>> #import <Foundation/Foundation.h>
>>
>> static UniChar characters[] = ; // LATIN CAPITAL LETTER A
>> WITH GRAVE
>>
>> #define MAX_BUFFER_LENGTH (100)
>>
>> int main (int argc, const char * argv[]) {
>>    NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
>>    UnicodeToTextInfo textInfo;
>>    UnicodeMapping mapping =
>> {CreateTextEncoding(kTextEncodingUnicodeDefault,
>> kTextEncodingDefaultVariant, kUnicode16BitFormat),
>> CreateTextEncoding(kTextEncodingUnicodeDefault,
>> kUnicodeCanonicalDecompVariant, kUnicode16BitFormat),
>> kUnicodeUseLatestMapping};
>>    UniChar buffer[MAX_BUFFER_LENGTH];
>>    ByteCount inputRead, outputLen;
>>    OSStatus status;
>>
>>    status = CreateUnicodeToTextInfo(&mapping, &textInfo);
>>    if (noErr != status) {
>>        NSLog(@"Failed to create UnicodeToTextInfo");
>>        exit(1);
>>    }
>>
>>    status = ConvertFromUnicodeToText(textInfo, sizeof(characters),
>> characters, kTECKeepInfoFixMask, 0, NULL, NULL, NULL,
>> MAX_BUFFER_LENGTH * sizeof(UniChar), &inputRead, &outputLen, buffer);
>>    if (noErr != status) {
>>        NSLog(@"Failed to convert string");
>>        exit(1);
>>    }
>>
>>    DisposeUnicodeToTextInfo(&textInfo);
>>
>>    [pool release];
>>    return 0;
>> }

_______________________________________________
cocoa-dev mailing list | <email_removed>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.

Related mailsAuthorDate
mlUnicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 04:54
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 14, 20:39
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 20:45
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 14, 22:08
mlRe: Unicode canonical decomposed form and text encoding Clark S. Cox III Jan 14, 23:05
mlRe: Unicode canonical decomposed form and text encoding Dietrich Epp Jan 14, 23:13
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 14, 23:44
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 01:11
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 15, 01:39
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:26
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:43
mlRe: Unicode canonical decomposed form and text encoding Aki Inoue Jan 15, 02:44
mlRe: Unicode canonical decomposed form and text encoding Renaud Boisjoly Jan 15, 02:57