[Q] UTF-8 stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding weirdness

  • Hello, all.

    I found a very interesting and strange behaviour of the
    NSString:stringByAddingPercentEscapedUsingEncoding.

    I got a UTF-8 string from a Final Cut Pro project file, which is
    exported as an XML.
    There is a video clip named "ìž?ì—°", which means "Nature" in Korean.
    And its pathurl is
    file://localhost/Users/young/Movies/%E1%84%8C%E1%85%A1%E1%84%8B%E1%85%A7%E1%86%AB.mov
    The ìž?ì—° part is %E1%84%8C%E1%85%A1%E1%84%8B%E1%85%A7%E1%86%AB.
    So, it is percent escaped string.

    So, I tried getting a UTF8 version of "ìž?ì—°" by issuing either of :

    1. NSString *anUTF16String = [NSString stringWithString:@"ìž?ì—°"];
    NSString *anUTF8String = [anUTF16String UTF8String];

    or
    2. NSString *anUTF8String = [NSString stringWithUTF8String:"ìž?ì—°"];

    And they returned same data.

    And, I tried making a percent escaped string by calling :
    NSString *anUTF8PercentEscapedString = [anUTF8String
    stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];

    And tried reverting back to original string by calling :
    NSString *revertedUTF8String = [anUTF8PercentEscapedString
    stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];

    This gave me the same original data to the one tried in either 1 or 2 above.
    And I checked what data it contains by calling :

    3. char *revertedCStringOne = (char *)[revertedUTF8String
    cStringUsingEncoding:NSUTF8StringEncoding];

    It was : EC 9E 90 EC 97 B0

    As I mentioned above, the pathurl string of FCP project looks different
    from the result 3.
    So, I tried converting the Korean part of the pathurl by calling :

    char test[] ={ 0xE1, 0x84, 0x8C, 0xE1, 0x85, 0xA1, 0xE1, 0x84, 0x8B,
    0xE1, 0x85, 0xA7, 0xE1, 0x86, 0xAB, 0};
    length = strlen( test );
    for( i = 0; i < length; i++ )
    {
    NSLog(@"%X", test[i] );
    }
    printf("\n");

    // 4. It prints the same "ìž?ì—°"
    NSString *questionedString = [NSString stringWithUTF8String:test];
    NSLog(@"Questioned String = %@", questionedString );

    and.. when the questionedString is converted to a percent escaped string
    by calling :
    NSString *questionedPercentEscapedString = [questionedString
    stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
    NSLog(@"%@", questionedPercentEscapedString);

    It was same to the one in the FCP project pathurl, ie. %E1%84%8C

    Can anyone tell me why the two different data source are displayed as
    same "ìž?ì—°", while what it contains are different?
    I would like to send an Apple event to the Final Cut Pro, but I'm not
    sure if it is OK to send the percent escaped one like 1 or 2, or the one
    in the FCP project. ( I don't know how to generate the one like in the
    FCP project XML file. )

    I also tried a Java applet,
    http://www.profitcode.net/resources/tools/utf8_encoder_applet.html,
    and its result is same to the one tried at 1 or 2 above. It is different
    from the one in the FCP project.

    I will appreciate any help.

    Thank you.

    P.S. My whole code is here, just in case.

    -----------------------------------------------------------------------------------------------------------------------------------------------

    NSString *anUTF16String = [NSString stringWithString:@"ìž?ì—°"];
    //NSString *anUTF16PercentEscapedString = [anUTF16String
    stringByAddingPercentEscapesUsingEncoding:NSUTF16StringEncoding];
    char *UTF16CString = (char *)[anUTF16String
    cStringUsingEncoding:NSUTF16StringEncoding];

    // 1. Making an NSString object with a UTF8 encoding
    NSString *anUTF8String = [NSString stringWithUTF8String:"ìž?ì—°"];
    NSString *anUTF8PercentEscapedString = [anUTF8String
    stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
    NSString *revertedUTF8String = [anUTF8PercentEscapedString
    stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
    char *revertedCStringOne = (char *)[revertedUTF8String
    cStringUsingEncoding:NSUTF8StringEncoding];

    NSLog(@"Unicode 16 : %@", anUTF16String );
    //NSLog(@"Unicode 16 Percent Escaped : %@", anUTF16PercentEscapedString );

    NSLog(@"Unicode 8 : %@", anUTF8String );
    NSLog(@"Unicode 8 Percent Escaped : %@", anUTF8PercentEscapedString );
    NSLog(@"Reverted from Unicode 8 Percent Escaped : %@", revertedUTF8String );
    NSLog(@"bytes : %s", revertedCStringOne );

    // 2. The data : EC 9E 90 EC 97 B0
    int length = strlen( revertedCStringOne );
    int i;
    for( i = 0; i < length; i++ )
    {
    NSLog(@"%X", revertedCStringOne[i] );
    }
    printf("\n");

    // 3. Data from a Final Cut Pro XML project file which is same to "ìž?ì—°"
    // This looks very different from what you can see from // 2.
    char test[] ={ 0xE1, 0x84, 0x8C, 0xE1, 0x85, 0xA1, 0xE1, 0x84, 0x8B,
    0xE1, 0x85, 0xA7, 0xE1, 0x86, 0xAB, 0};
    length = strlen( test );
    for( i = 0; i < length; i++ )
    {
    NSLog(@"%X", test[i] );
    }
    printf("\n");

    // 4. It prints the same "ìž?ì—°"
    NSString *questionedString = [NSString stringWithUTF8String:test];
    NSLog(@"Questioned String = %@", questionedString );

    // 5. Percent Escape representation of it is same to that of //3 not //2
    NSString *questionedPercentEscapedString = [questionedString
    stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
    NSLog(@"%@", questionedPercentEscapedString);
  • On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:

    > Can anyone tell me why the two different data source are displayed
    > as same "ìž?ì—°", while what it contains are different?

    I haven't looked into the specific character sequences in-depth, but
    I suspect the difference is in Normalization Forms.  Specifically,
    form C vs. D.

    http://unicode.org/reports/tr15/

    The idea is that the same character can be obtained from a single
    code point or by several combining code points.

    In Cocoa, see -precomposedStringWithCanonicalMapping and -
    decomposedStringWithCanonicalMapping.

    Cheers,
    Ken
  • On Jun 18, 2008, at 12:24 PM, Ken Thomases wrote:

    > On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:
    >
    >> Can anyone tell me why the two different data source are displayed
    >> as same "ìž?ì—°", while what it contains are different?
    >
    > I haven't looked into the specific character sequences in-depth, but
    > I suspect the difference is in Normalization Forms.  Specifically,
    > form C vs. D.
    >
    > http://unicode.org/reports/tr15/
    >
    > The idea is that the same character can be obtained from a single
    > code point or by several combining code points.
    >
    > In Cocoa, see -precomposedStringWithCanonicalMapping and -
    > decomposedStringWithCanonicalMapping.

    Sure looks like it, based on the data.  EC 9E 90 is U+C790, "ìž?"; E1
    84 8C E1 85 A1 is U+110C "ᄌ", U+1161 "ᅡ", which is the decomposed
    version of the same thing.  -[NSString fileSystemRepresentation] may
    also be of use here, given that this is really a file path -- the
    normalization form used for file names is dictated by the file system.

    --Chris Nebel
    AppleScript Engineering
  • Thank you very much for the information.

    I even didn't think about the normalization. Wow.. it is quite complicated.
    I tried the 4 methods,
    -precomposedStringWith*[Canonical/Compatibility]*Mapping and
    -decomposedStringWith*[Canonical/Compatibility]*Mapping.

    The result was that [NSString UTF8String] returns "precomposed" version,
    while the one used in the FCP was "decomposed".

    Thank you again.

    Ken Thomases wrote:
    > On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:
    >
    >> Can anyone tell me why the two different data source are displayed as
    >> same "ìž?ì—°", while what it contains are different?
    >
    > I haven't looked into the specific character sequences in-depth, but I
    > suspect the difference is in Normalization Forms.  Specifically, form
    > C vs. D.
    >
    > http://unicode.org/reports/tr15/
    >
    > The idea is that the same character can be obtained from a single code
    > point or by several combining code points.
    >
    > In Cocoa, see -precomposedStringWithCanonicalMapping and
    > -decomposedStringWithCanonicalMapping.
    >
    > Cheers,
    > Ken
  • On Jun 18, 2008, at 3:47 PM, JongAm Park wrote:

    > Thank you very much for the information.

    You're welcome.

    > I even didn't think about the normalization. Wow.. it is quite
    > complicated.
    > I tried the 4 methods, -precomposedStringWith[Canonical/
    > Compatibility]Mapping and -decomposedStringWith[Canonical/
    > Compatibility]Mapping.
    >
    > The result was that [NSString UTF8String] returns "precomposed"
    > version

    That's not quite accurate.  Any given string will be in precomposed
    or decomposed form (or it might not be normalized to either form, and
    have a mix).  Whatever form that string is in, -UTF8String will
    maintain it.  So, -UTF8String doesn't necessarily return
    "precomposed" form, it just so happens that the string you got was
    already in precomposed form.

    > , while the one used in the FCP was "decomposed".

    The low-level file-system APIs on Mac OS X use what Apple calls "file-
    system representation", which is mostly decomposed (NFD) with some
    specific exceptions.  So, any time you obtain a file name from the
    file-system -- by enumerating a directory or from an NSOpenPanel, for
    example -- it's likely to be mostly decomposed.  This is true even if
    the name originally used to create the file was passed in precomposed
    form.

    If you want the string in a specific normalization form for some
    reason, you need to transform it using the above methods.  Don't rely
    on "file-system representation" being in any particular form.  You
    can compare strings without regard for normalization form using one
    of the -compare:... methods and _not_ specifying NSLiteralSearch.
    Note that isEqual: and isEqualToString: _do_ specify NSLiteralSearch
    (or the equivalent) and so can report NO for two strings which
    display identically.

    Cheers,
    Ken
  • >> I even didn't think about the normalization. Wow.. it is quite
    >> complicated.
    >> I tried the 4 methods,
    >> -precomposedStringWith[Canonical/Compatibility]Mapping and
    >> -decomposedStringWith[Canonical/Compatibility]Mapping.
    >>
    >> The result was that [NSString UTF8String] returns "precomposed" version
    >
    > That's not quite accurate.  Any given string will be in precomposed or
    > decomposed form (or it might not be normalized to either form, and
    > have a mix).  Whatever form that string is in, -UTF8String will
    > maintain it.  So, -UTF8String doesn't necessarily return "precomposed"
    > form, it just so happens that the string you got was already in
    > precomposed form.
    >
    >
    You are right. It depends on an original string. The NSString is quite
    smart...

    >> , while the one used in the FCP was "decomposed".
    >
    > The low-level file-system APIs on Mac OS X use what Apple calls
    > "file-system representation", which is mostly decomposed (NFD) with
    > some specific exceptions.  So, any time you obtain a file name from
    > the file-system -- by enumerating a directory or from an NSOpenPanel,
    > for example -- it's likely to be mostly decomposed.  This is true even
    > if the name originally used to create the file was passed in
    > precomposed form.
    >
    > If you want the string in a specific normalization form for some
    > reason, you need to transform it using the above methods.  Don't rely
    > on "file-system representation" being in any particular form.  You can
    > compare strings without regard for normalization form using one of the
    > -compare:... methods and _not_ specifying NSLiteralSearch.  Note that
    > isEqual: and isEqualToString: _do_ specify NSLiteralSearch (or the
    > equivalent) and so can report NO for two strings which display
    > identically.
    >
    > Cheers,
    > Ken
    >
    I tested with the compare: method. It could return "Same" when a
    decomposed string is compared with a composed string.
    So, when Unicode is to be handled, it would be safer if the compare:
    function is used instead of isEqual.

    ( NSString even provides comparison with localized strings. I'm
    impressed!!! )

    Thank you for the good information. Although I have used the NSString, I
    didn't know what those methods really meant. But now, I opened my eyes!!!
previous month june 2008 next month
MTWTFSS
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30            
Go to today