FROM : Jens Alfke
DATE : Wed May 07 04:22:15 2008
On 6 May '08, at 10:45 AM, Aki Inoue wrote:
> Actually, I don't recommend using CP1252 as the generic fallback
> encoding like this.
> The encoding does have gaps, and the handling of those invalid gaps
> varies between conversion engines. CF/NSString treat the invalid
> bytes strictly and return nil encountering those.
I wasn't aware it had gaps — I've never run into them. Where are they?
> So, our recommendation now is to try UTF-8 first; then, try some
> other encoding deduced from the context (user's localization,
> intended source/destination of the data, etc). If all failed,
> should try MacRoman as the ultimate fallback (the encoding has no
> gap so never fails).
In the contexts I've been dealing with — data fetched over HTTP from
random websites — there hasn't been anything deducible from the
context (assuming the HTTP Content-Type already failed.) In that
situation MacRoman is not at all a good fallback as almost no Web
content uses it; CP-1252 or ISO-Latin-1 are the most likely fallbacks
after UTF-8.
—Jens
DATE : Wed May 07 04:22:15 2008
On 6 May '08, at 10:45 AM, Aki Inoue wrote:
> Actually, I don't recommend using CP1252 as the generic fallback
> encoding like this.
> The encoding does have gaps, and the handling of those invalid gaps
> varies between conversion engines. CF/NSString treat the invalid
> bytes strictly and return nil encountering those.
I wasn't aware it had gaps — I've never run into them. Where are they?
> So, our recommendation now is to try UTF-8 first; then, try some
> other encoding deduced from the context (user's localization,
> intended source/destination of the data, etc). If all failed,
> should try MacRoman as the ultimate fallback (the encoding has no
> gap so never fails).
In the contexts I've been dealing with — data fetched over HTTP from
random websites — there hasn't been anything deducible from the
context (assuming the HTTP Content-Type already failed.) In that
situation MacRoman is not at all a good fallback as almost no Web
content uses it; CP-1252 or ISO-Latin-1 are the most likely fallbacks
after UTF-8.
—Jens






Cocoa mail archive

