FROM : Jean-Daniel Dupas
DATE : Wed May 07 21:36:49 2008
What make you think this function assumes an exact encoding ? This
method is not the same than +[NSString
stringWithContentsOfFile:encoding:error:].
The method +stringWithContentsOfFile:usedEncoding:error: returns the
sniffed encoding by reference using the second argument. At least
that's what the documentation says: “ This method attempts to
determine the encoding of the file at path.”
This method was introduced in Tiger, that's maybe why you never see it
before.
Le 7 mai 08 à 21:27, Gary L. Wade a écrit :
> No, that's not the same thing. The method you suggest assumes an
> exact encoding; the sniffer functions from TextEncodingConverter
> look at the data to see if it follows the patterns appropriate for a
> suggested set of encodings and lets you know which one would be the
> best match. Typically, such sniffers are best for differentiating
> DBCS-based characters where there's a sequence like you'd find in
> Shift-JIS and the like. Let me know when you find the "Cocoa" way
> to do this.
>
>> More modern and more Cocoa way? You mean something like this +
>> [NSString stringWithContentsOfFile:usedEncoding:error:] ;-)
>>
>> «Discussion
>> This method attempts to determine the encoding of the file at path.»
>>
>> Le 7 mai 08 à 19:33, Gary L. Wade a écrit :
>>
>>> If you're interested in determining the best encoding match for
>>> text, look at the TextEncodingConverter.h header, which has
>>> functions related to encoding sniffing. There may be more modern
>>> techniques available, but I had used that almost a decade ago in a
>>> formerly major web browser. It's not perfect, of course, but it
>>> might be the best solution for your problem.
>>>
>>>>
>>>> On May 6, 2008, at 9:22 PM, Jens Alfke wrote:
>>>>
>>>>>
>>>>> On 6 May '08, at 10:45 AM, Aki Inoue wrote:
>>>>>
>>>>>> Actually, I don't recommend using CP1252 as the generic fallback
>>>>>> encoding like this.
>>>>>> The encoding does have gaps, and the handling of those invalid
>>>>>> gaps
>>>>>> varies between conversion engines. CF/NSString treat the invalid
>>>>>> bytes strictly and return nil encountering those.
>>>>>
>>>>> I wasn't aware it had gaps — I've never run into them. Where are
>>>>> they?
>>>>
>>>> <http://en.wikipedia.org/wiki/Windows-1252>
>>>>
>>>> 5 characters in the 0x80..0x9F range.
>>>>
>>>>>> So, our recommendation now is to try UTF-8 first; then, try some
>>>>>> other encoding deduced from the context (user's localization,
>>>>>> intended source/destination of the data, etc). If all failed,
>>>>>> should try MacRoman as the ultimate fallback (the encoding has no
>>>>>> gap so never fails).
>>>>>
>>>>> In the contexts I've been dealing with — data fetched over HTTP
>>>>> from
>>>>> random websites — there hasn't been anything deducible from the
>>>>> context (assuming the HTTP Content-Type already failed.) In that
>>>>> situation MacRoman is not at all a good fallback as almost no Web
>>>>> content uses it; CP-1252 or ISO-Latin-1 are the most likely
>>>>> fallbacks after UTF-8.
>>>>
>>>>
>>>> I will agree with this if it's web content you're dealing with.
>>>> Although, just do a fallback to windows1252. Lots of site content
>>>> was
>>>> authored with that encoding and mistakenly marked as ISO_8859-1.
>>>> But
>>>> that's a topic for another forum.
>>>>
>>> _______________________________________________
>>>
>>> Cocoa-dev mailing list (<email_removed>)
>>>
>>> Please do not post admin requests or moderator comments to the list.
>>> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>>>
>>> Help/Unsubscribe/Update your Subscription:
>>> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>>>
>>> This email sent to <email_removed>
>>>
>>
>
DATE : Wed May 07 21:36:49 2008
What make you think this function assumes an exact encoding ? This
method is not the same than +[NSString
stringWithContentsOfFile:encoding:error:].
The method +stringWithContentsOfFile:usedEncoding:error: returns the
sniffed encoding by reference using the second argument. At least
that's what the documentation says: “ This method attempts to
determine the encoding of the file at path.”
This method was introduced in Tiger, that's maybe why you never see it
before.
Le 7 mai 08 à 21:27, Gary L. Wade a écrit :
> No, that's not the same thing. The method you suggest assumes an
> exact encoding; the sniffer functions from TextEncodingConverter
> look at the data to see if it follows the patterns appropriate for a
> suggested set of encodings and lets you know which one would be the
> best match. Typically, such sniffers are best for differentiating
> DBCS-based characters where there's a sequence like you'd find in
> Shift-JIS and the like. Let me know when you find the "Cocoa" way
> to do this.
>
>> More modern and more Cocoa way? You mean something like this +
>> [NSString stringWithContentsOfFile:usedEncoding:error:] ;-)
>>
>> «Discussion
>> This method attempts to determine the encoding of the file at path.»
>>
>> Le 7 mai 08 à 19:33, Gary L. Wade a écrit :
>>
>>> If you're interested in determining the best encoding match for
>>> text, look at the TextEncodingConverter.h header, which has
>>> functions related to encoding sniffing. There may be more modern
>>> techniques available, but I had used that almost a decade ago in a
>>> formerly major web browser. It's not perfect, of course, but it
>>> might be the best solution for your problem.
>>>
>>>>
>>>> On May 6, 2008, at 9:22 PM, Jens Alfke wrote:
>>>>
>>>>>
>>>>> On 6 May '08, at 10:45 AM, Aki Inoue wrote:
>>>>>
>>>>>> Actually, I don't recommend using CP1252 as the generic fallback
>>>>>> encoding like this.
>>>>>> The encoding does have gaps, and the handling of those invalid
>>>>>> gaps
>>>>>> varies between conversion engines. CF/NSString treat the invalid
>>>>>> bytes strictly and return nil encountering those.
>>>>>
>>>>> I wasn't aware it had gaps — I've never run into them. Where are
>>>>> they?
>>>>
>>>> <http://en.wikipedia.org/wiki/Windows-1252>
>>>>
>>>> 5 characters in the 0x80..0x9F range.
>>>>
>>>>>> So, our recommendation now is to try UTF-8 first; then, try some
>>>>>> other encoding deduced from the context (user's localization,
>>>>>> intended source/destination of the data, etc). If all failed,
>>>>>> should try MacRoman as the ultimate fallback (the encoding has no
>>>>>> gap so never fails).
>>>>>
>>>>> In the contexts I've been dealing with — data fetched over HTTP
>>>>> from
>>>>> random websites — there hasn't been anything deducible from the
>>>>> context (assuming the HTTP Content-Type already failed.) In that
>>>>> situation MacRoman is not at all a good fallback as almost no Web
>>>>> content uses it; CP-1252 or ISO-Latin-1 are the most likely
>>>>> fallbacks after UTF-8.
>>>>
>>>>
>>>> I will agree with this if it's web content you're dealing with.
>>>> Although, just do a fallback to windows1252. Lots of site content
>>>> was
>>>> authored with that encoding and mistakenly marked as ISO_8859-1.
>>>> But
>>>> that's a topic for another forum.
>>>>
>>> _______________________________________________
>>>
>>> Cocoa-dev mailing list (<email_removed>)
>>>
>>> Please do not post admin requests or moderator comments to the list.
>>> Contact the moderators at cocoa-dev-admins(at)lists.apple.com
>>>
>>> Help/Unsubscribe/Update your Subscription:
>>> http://lists.apple.com/mailman/options/cocoa-dev/<email_removed>
>>>
>>> This email sent to <email_removed>
>>>
>>
>






Cocoa mail archive

