[Q] UTF-8 stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding weirdness
-
Hello, all.
I found a very interesting and strange behaviour of the
NSString:stringByAddingPercentEscapedUsingEncoding.
I got a UTF-8 string from a Final Cut Pro project file, which is
exported as an XML.
There is a video clip named "ìž?ì—°", which means "Nature" in Korean.
And its pathurl is
file://localhost/Users/young/Movies/%E1%84%8C%E1%85%A1%E1%84%8B%E1%85%A7%E1%86%AB.mov
The ìž?ì—° part is %E1%84%8C%E1%85%A1%E1%84%8B%E1%85%A7%E1%86%AB.
So, it is percent escaped string.
So, I tried getting a UTF8 version of "ìž?ì—°" by issuing either of :
1. NSString *anUTF16String = [NSString stringWithString:@"ìž?ì—°"];
NSString *anUTF8String = [anUTF16String UTF8String];
or
2. NSString *anUTF8String = [NSString stringWithUTF8String:"ìž?ì—°"];
And they returned same data.
And, I tried making a percent escaped string by calling :
NSString *anUTF8PercentEscapedString = [anUTF8String
stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
And tried reverting back to original string by calling :
NSString *revertedUTF8String = [anUTF8PercentEscapedString
stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
This gave me the same original data to the one tried in either 1 or 2 above.
And I checked what data it contains by calling :
3. char *revertedCStringOne = (char *)[revertedUTF8String
cStringUsingEncoding:NSUTF8StringEncoding];
It was : EC 9E 90 EC 97 B0
As I mentioned above, the pathurl string of FCP project looks different
from the result 3.
So, I tried converting the Korean part of the pathurl by calling :
char test[] ={ 0xE1, 0x84, 0x8C, 0xE1, 0x85, 0xA1, 0xE1, 0x84, 0x8B,
0xE1, 0x85, 0xA7, 0xE1, 0x86, 0xAB, 0};
length = strlen( test );
for( i = 0; i < length; i++ )
{
NSLog(@"%X", test[i] );
}
printf("\n");
// 4. It prints the same "ìž?ì—°"
NSString *questionedString = [NSString stringWithUTF8String:test];
NSLog(@"Questioned String = %@", questionedString );
and.. when the questionedString is converted to a percent escaped string
by calling :
NSString *questionedPercentEscapedString = [questionedString
stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSLog(@"%@", questionedPercentEscapedString);
It was same to the one in the FCP project pathurl, ie. %E1%84%8C
Can anyone tell me why the two different data source are displayed as
same "ìž?ì—°", while what it contains are different?
I would like to send an Apple event to the Final Cut Pro, but I'm not
sure if it is OK to send the percent escaped one like 1 or 2, or the one
in the FCP project. ( I don't know how to generate the one like in the
FCP project XML file. )
I also tried a Java applet,
http://www.profitcode.net/resources/tools/utf8_encoder_applet.html,
and its result is same to the one tried at 1 or 2 above. It is different
from the one in the FCP project.
I will appreciate any help.
Thank you.
P.S. My whole code is here, just in case.
-----------------------------------------------------------------------------------------------------------------------------------------------
NSString *anUTF16String = [NSString stringWithString:@"ìž?ì—°"];
//NSString *anUTF16PercentEscapedString = [anUTF16String
stringByAddingPercentEscapesUsingEncoding:NSUTF16StringEncoding];
char *UTF16CString = (char *)[anUTF16String
cStringUsingEncoding:NSUTF16StringEncoding];
// 1. Making an NSString object with a UTF8 encoding
NSString *anUTF8String = [NSString stringWithUTF8String:"ìž?ì—°"];
NSString *anUTF8PercentEscapedString = [anUTF8String
stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSString *revertedUTF8String = [anUTF8PercentEscapedString
stringByReplacingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
char *revertedCStringOne = (char *)[revertedUTF8String
cStringUsingEncoding:NSUTF8StringEncoding];
NSLog(@"Unicode 16 : %@", anUTF16String );
//NSLog(@"Unicode 16 Percent Escaped : %@", anUTF16PercentEscapedString );
NSLog(@"Unicode 8 : %@", anUTF8String );
NSLog(@"Unicode 8 Percent Escaped : %@", anUTF8PercentEscapedString );
NSLog(@"Reverted from Unicode 8 Percent Escaped : %@", revertedUTF8String );
NSLog(@"bytes : %s", revertedCStringOne );
// 2. The data : EC 9E 90 EC 97 B0
int length = strlen( revertedCStringOne );
int i;
for( i = 0; i < length; i++ )
{
NSLog(@"%X", revertedCStringOne[i] );
}
printf("\n");
// 3. Data from a Final Cut Pro XML project file which is same to "ìž?ì—°"
// This looks very different from what you can see from // 2.
char test[] ={ 0xE1, 0x84, 0x8C, 0xE1, 0x85, 0xA1, 0xE1, 0x84, 0x8B,
0xE1, 0x85, 0xA7, 0xE1, 0x86, 0xAB, 0};
length = strlen( test );
for( i = 0; i < length; i++ )
{
NSLog(@"%X", test[i] );
}
printf("\n");
// 4. It prints the same "ìž?ì—°"
NSString *questionedString = [NSString stringWithUTF8String:test];
NSLog(@"Questioned String = %@", questionedString );
// 5. Percent Escape representation of it is same to that of //3 not //2
NSString *questionedPercentEscapedString = [questionedString
stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSLog(@"%@", questionedPercentEscapedString); -
On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:
> Can anyone tell me why the two different data source are displayed
> as same "ìž?ì—°", while what it contains are different?
I haven't looked into the specific character sequences in-depth, but
I suspect the difference is in Normalization Forms. Specifically,
form C vs. D.
http://unicode.org/reports/tr15/
The idea is that the same character can be obtained from a single
code point or by several combining code points.
In Cocoa, see -precomposedStringWithCanonicalMapping and -
decomposedStringWithCanonicalMapping.
Cheers,
Ken -
On Jun 18, 2008, at 12:24 PM, Ken Thomases wrote:
> On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:
>
>> Can anyone tell me why the two different data source are displayed
>> as same "ìž?ì—°", while what it contains are different?
>
> I haven't looked into the specific character sequences in-depth, but
> I suspect the difference is in Normalization Forms. Specifically,
> form C vs. D.
>
> http://unicode.org/reports/tr15/
>
> The idea is that the same character can be obtained from a single
> code point or by several combining code points.
>
> In Cocoa, see -precomposedStringWithCanonicalMapping and -
> decomposedStringWithCanonicalMapping.
Sure looks like it, based on the data. EC 9E 90 is U+C790, "ìž?"; E1
84 8C E1 85 A1 is U+110C "ᄌ", U+1161 "ᅡ", which is the decomposed
version of the same thing. -[NSString fileSystemRepresentation] may
also be of use here, given that this is really a file path -- the
normalization form used for file names is dictated by the file system.
--Chris Nebel
AppleScript Engineering -
Thank you very much for the information.
I even didn't think about the normalization. Wow.. it is quite complicated.
I tried the 4 methods,
-precomposedStringWith*[Canonical/Compatibility]*Mapping and
-decomposedStringWith*[Canonical/Compatibility]*Mapping.
The result was that [NSString UTF8String] returns "precomposed" version,
while the one used in the FCP was "decomposed".
Thank you again.
Ken Thomases wrote:
> On Jun 18, 2008, at 1:49 PM, JongAm Park wrote:
>
>> Can anyone tell me why the two different data source are displayed as
>> same "ìž?ì—°", while what it contains are different?
>
> I haven't looked into the specific character sequences in-depth, but I
> suspect the difference is in Normalization Forms. Specifically, form
> C vs. D.
>
> http://unicode.org/reports/tr15/
>
> The idea is that the same character can be obtained from a single code
> point or by several combining code points.
>
> In Cocoa, see -precomposedStringWithCanonicalMapping and
> -decomposedStringWithCanonicalMapping.
>
> Cheers,
> Ken
-
On Jun 18, 2008, at 3:47 PM, JongAm Park wrote:
> Thank you very much for the information.
You're welcome.
> I even didn't think about the normalization. Wow.. it is quite
> complicated.
> I tried the 4 methods, -precomposedStringWith[Canonical/
> Compatibility]Mapping and -decomposedStringWith[Canonical/
> Compatibility]Mapping.
>
> The result was that [NSString UTF8String] returns "precomposed"
> version
That's not quite accurate. Any given string will be in precomposed
or decomposed form (or it might not be normalized to either form, and
have a mix). Whatever form that string is in, -UTF8String will
maintain it. So, -UTF8String doesn't necessarily return
"precomposed" form, it just so happens that the string you got was
already in precomposed form.
> , while the one used in the FCP was "decomposed".
The low-level file-system APIs on Mac OS X use what Apple calls "file-
system representation", which is mostly decomposed (NFD) with some
specific exceptions. So, any time you obtain a file name from the
file-system -- by enumerating a directory or from an NSOpenPanel, for
example -- it's likely to be mostly decomposed. This is true even if
the name originally used to create the file was passed in precomposed
form.
If you want the string in a specific normalization form for some
reason, you need to transform it using the above methods. Don't rely
on "file-system representation" being in any particular form. You
can compare strings without regard for normalization form using one
of the -compare:... methods and _not_ specifying NSLiteralSearch.
Note that isEqual: and isEqualToString: _do_ specify NSLiteralSearch
(or the equivalent) and so can report NO for two strings which
display identically.
Cheers,
Ken -
>> I even didn't think about the normalization. Wow.. it is quiteYou are right. It depends on an original string. The NSString is quite
>> complicated.
>> I tried the 4 methods,
>> -precomposedStringWith[Canonical/Compatibility]Mapping and
>> -decomposedStringWith[Canonical/Compatibility]Mapping.
>>
>> The result was that [NSString UTF8String] returns "precomposed" version
>
> That's not quite accurate. Any given string will be in precomposed or
> decomposed form (or it might not be normalized to either form, and
> have a mix). Whatever form that string is in, -UTF8String will
> maintain it. So, -UTF8String doesn't necessarily return "precomposed"
> form, it just so happens that the string you got was already in
> precomposed form.
>
>
smart...
>> , while the one used in the FCP was "decomposed".I tested with the compare: method. It could return "Same" when a
>
> The low-level file-system APIs on Mac OS X use what Apple calls
> "file-system representation", which is mostly decomposed (NFD) with
> some specific exceptions. So, any time you obtain a file name from
> the file-system -- by enumerating a directory or from an NSOpenPanel,
> for example -- it's likely to be mostly decomposed. This is true even
> if the name originally used to create the file was passed in
> precomposed form.
>
> If you want the string in a specific normalization form for some
> reason, you need to transform it using the above methods. Don't rely
> on "file-system representation" being in any particular form. You can
> compare strings without regard for normalization form using one of the
> -compare:... methods and _not_ specifying NSLiteralSearch. Note that
> isEqual: and isEqualToString: _do_ specify NSLiteralSearch (or the
> equivalent) and so can report NO for two strings which display
> identically.
>
> Cheers,
> Ken
>
decomposed string is compared with a composed string.
So, when Unicode is to be handled, it would be safer if the compare:
function is used instead of isEqual.
( NSString even provides comparison with localized strings. I'm
impressed!!! )
Thank you for the good information. Although I have used the NSString, I
didn't know what those methods really meant. But now, I opened my eyes!!!


