FROM : Ricky Sharp
DATE : Wed May 07 04:53:34 2008
On May 6, 2008, at 9:22 PM, Jens Alfke wrote:
>
> On 6 May '08, at 10:45 AM, Aki Inoue wrote:
>
>> Actually, I don't recommend using CP1252 as the generic fallback
>> encoding like this.
>> The encoding does have gaps, and the handling of those invalid gaps
>> varies between conversion engines. CF/NSString treat the invalid
>> bytes strictly and return nil encountering those.
>
> I wasn't aware it had gaps — I've never run into them. Where are they?
<http://en.wikipedia.org/wiki/Windows-1252>
5 characters in the 0x80..0x9F range.
>> So, our recommendation now is to try UTF-8 first; then, try some
>> other encoding deduced from the context (user's localization,
>> intended source/destination of the data, etc). If all failed,
>> should try MacRoman as the ultimate fallback (the encoding has no
>> gap so never fails).
>
> In the contexts I've been dealing with — data fetched over HTTP from
> random websites — there hasn't been anything deducible from the
> context (assuming the HTTP Content-Type already failed.) In that
> situation MacRoman is not at all a good fallback as almost no Web
> content uses it; CP-1252 or ISO-Latin-1 are the most likely
> fallbacks after UTF-8.
I will agree with this if it's web content you're dealing with.
Although, just do a fallback to windows1252. Lots of site content was
authored with that encoding and mistakenly marked as ISO_8859-1. But
that's a topic for another forum.
___________________________________________________________
Ricky A. Sharp mailto:<email_removed>
Instant Interactive(tm) http://www.instantinteractive.com
DATE : Wed May 07 04:53:34 2008
On May 6, 2008, at 9:22 PM, Jens Alfke wrote:
>
> On 6 May '08, at 10:45 AM, Aki Inoue wrote:
>
>> Actually, I don't recommend using CP1252 as the generic fallback
>> encoding like this.
>> The encoding does have gaps, and the handling of those invalid gaps
>> varies between conversion engines. CF/NSString treat the invalid
>> bytes strictly and return nil encountering those.
>
> I wasn't aware it had gaps — I've never run into them. Where are they?
<http://en.wikipedia.org/wiki/Windows-1252>
5 characters in the 0x80..0x9F range.
>> So, our recommendation now is to try UTF-8 first; then, try some
>> other encoding deduced from the context (user's localization,
>> intended source/destination of the data, etc). If all failed,
>> should try MacRoman as the ultimate fallback (the encoding has no
>> gap so never fails).
>
> In the contexts I've been dealing with — data fetched over HTTP from
> random websites — there hasn't been anything deducible from the
> context (assuming the HTTP Content-Type already failed.) In that
> situation MacRoman is not at all a good fallback as almost no Web
> content uses it; CP-1252 or ISO-Latin-1 are the most likely
> fallbacks after UTF-8.
I will agree with this if it's web content you're dealing with.
Although, just do a fallback to windows1252. Lots of site content was
authored with that encoding and mistakenly marked as ISO_8859-1. But
that's a topic for another forum.
___________________________________________________________
Ricky A. Sharp mailto:<email_removed>
Instant Interactive(tm) http://www.instantinteractive.com






Cocoa mail archive

