FROM : James Montgomerie
DATE : Fri Jun 06 20:19:50 2008
On 6 Jun 2008, at 08:03, Jens Alfke wrote:
>
> On 6 Jun '08, at 3:23 AM, Jason Stephenson wrote:
>
>> As a long time UNIX programmer, I'll suggest looking into the
>> regexp library that already comes with OS X.
>> man regcomp on the command line to find out how to use.
>
> It doesn't look as though this library is Unicode-aware. The strings
> it takes are C string (char*) with no indication of what encoding is
> used, and Unicode or UTF-8 aren't mentioned in the man page. From
> that, I'd guess that this library only works with single-byte
> encodings (like ISO-Latin-1 or CP-1252, not UTF-8 or the various non-
> Roman encodings) and that it will treat all non-ascii characters as
> being not spaces and not letters.
>
> In short, I think it only works correctly with plain ascii. IMHO
> that's much too limited for most purposes nowadays. Even if you
> don't touch user-visible text with it, it's still pretty common to
> find non-ascii characters in HTML, XML, even source code.
>
> Of the regex libraries mentioned so far, I recommend RegexKitLite.
> It's based on ICU, which is Unicode-savvy, already built into the
> OS, and used by lots of Apple apps.
You are correct, but in my casual usage, feeding UTF-8 to the POSIX
regex routines works just fine if you take into account that the
defined character classes are ASCII-aware only, and are aware that the
results you get back are byte offsets, not character offsets - i.e.
don't convert them to NSRanges and expect them to be correct against
the NSString you got the UTF-8 from (similar caveats apply to match
counts etc. - i.e. "." will happily match two characters if they
take up three bytes).
I wouldn't want to present the regexes to the user, of course, but for
pre-defined regexes in code, it's okay (not great with those caveats
obviously, but alright).
My main complaint about it is that it's /extremely slow/ compared to
most modern regex libraries, but for casual usage, you at least don't
have to link any extra libraries to use it.
I do think that good regex additions to NSString, or an NSRegex class,
are highly overdue in Cocoa.
Jamie.
DATE : Fri Jun 06 20:19:50 2008
On 6 Jun 2008, at 08:03, Jens Alfke wrote:
>
> On 6 Jun '08, at 3:23 AM, Jason Stephenson wrote:
>
>> As a long time UNIX programmer, I'll suggest looking into the
>> regexp library that already comes with OS X.
>> man regcomp on the command line to find out how to use.
>
> It doesn't look as though this library is Unicode-aware. The strings
> it takes are C string (char*) with no indication of what encoding is
> used, and Unicode or UTF-8 aren't mentioned in the man page. From
> that, I'd guess that this library only works with single-byte
> encodings (like ISO-Latin-1 or CP-1252, not UTF-8 or the various non-
> Roman encodings) and that it will treat all non-ascii characters as
> being not spaces and not letters.
>
> In short, I think it only works correctly with plain ascii. IMHO
> that's much too limited for most purposes nowadays. Even if you
> don't touch user-visible text with it, it's still pretty common to
> find non-ascii characters in HTML, XML, even source code.
>
> Of the regex libraries mentioned so far, I recommend RegexKitLite.
> It's based on ICU, which is Unicode-savvy, already built into the
> OS, and used by lots of Apple apps.
You are correct, but in my casual usage, feeding UTF-8 to the POSIX
regex routines works just fine if you take into account that the
defined character classes are ASCII-aware only, and are aware that the
results you get back are byte offsets, not character offsets - i.e.
don't convert them to NSRanges and expect them to be correct against
the NSString you got the UTF-8 from (similar caveats apply to match
counts etc. - i.e. "." will happily match two characters if they
take up three bytes).
I wouldn't want to present the regexes to the user, of course, but for
pre-defined regexes in code, it's okay (not great with those caveats
obviously, but alright).
My main complaint about it is that it's /extremely slow/ compared to
most modern regex libraries, but for casual usage, you at least don't
have to link any extra libraries to use it.
I do think that good regex additions to NSString, or an NSRegex class,
are highly overdue in Cocoa.
Jamie.






Cocoa mail archive

