FROM : Jens Alfke
DATE : Tue Mar 04 22:08:29 2008
On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:
> I'm most-likely going to have to support many text-encodings. Say
> if I'm writing a document in Jaspanese (Mac OS), will I have to
> convert that to UTF-8 before the methods of something like RegexKit
> would work? Any caveats you know of that I need to be aware of? I'm
> learning by doing.
It's not the encoding that's an issue, at least not at the point
you're running a regex. Presumably you had to deal with encodings just
to get the data into an NSString in the first place.
The limitation of PCRE is in its handling of character classes. IIRC,
PCRE doesn't consider any character above 0x7F to be alphanumeric, so
regex character types like "\w" won't match non-ascii letters. Worse,
it detects word boundaries ("\b") by looking for a transition between
word and non-word characters. Here the problem isn't just that it
doesn't know about non-ascii word characters; it's that some languages
have more complex rules for detecting word breaks. In Japanese and
Thai, for example, words are often written without spaces in between
them, and you have to use linguistic rules to determine where the
breaks go. ICU knows how to do this.
The problem I ran into with PCRE is that I was implementing a typical
filter field (the one in Safari RSS) that needed to match word
prefixes. So the search regex began with "\b" to match the word break.
But it didn't work correctly on most kanji text.
(Now, this was a few years ago. It's possible that PCRE's Unicode
support has been improved since. If this is important to you, go
check the docs.)
—Jens
DATE : Tue Mar 04 22:08:29 2008
On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:
> I'm most-likely going to have to support many text-encodings. Say
> if I'm writing a document in Jaspanese (Mac OS), will I have to
> convert that to UTF-8 before the methods of something like RegexKit
> would work? Any caveats you know of that I need to be aware of? I'm
> learning by doing.
It's not the encoding that's an issue, at least not at the point
you're running a regex. Presumably you had to deal with encodings just
to get the data into an NSString in the first place.
The limitation of PCRE is in its handling of character classes. IIRC,
PCRE doesn't consider any character above 0x7F to be alphanumeric, so
regex character types like "\w" won't match non-ascii letters. Worse,
it detects word boundaries ("\b") by looking for a transition between
word and non-word characters. Here the problem isn't just that it
doesn't know about non-ascii word characters; it's that some languages
have more complex rules for detecting word breaks. In Japanese and
Thai, for example, words are often written without spaces in between
them, and you have to use linguistic rules to determine where the
breaks go. ICU knows how to do this.
The problem I ran into with PCRE is that I was implementing a typical
filter field (the one in Safari RSS) that needed to match word
prefixes. So the search regex began with "\b" to match the word break.
But it didn't work correctly on most kanji text.
(Now, this was a few years ago. It's possible that PCRE's Unicode
support has been improved since. If this is important to you, go
check the docs.)
—Jens
| Related mails | Author | Date |
|---|---|---|
| Jonathan Dann | Mar 3, 16:37 | |
| Mike Abdullah | Mar 3, 17:16 | |
| Jonathan Dann | Mar 3, 19:12 | |
| Dave Camp | Mar 3, 19:23 | |
| Jonathan Dann | Mar 4, 12:25 | |
| Jens Alfke | Mar 4, 18:50 | |
| Jonathan Dann | Mar 4, 19:19 | |
| Jens Alfke | Mar 4, 22:08 | |
| glenn andreas | Mar 4, 22:33 | |
| John Engelhart | Mar 5, 05:55 | |
| Jens Alfke | Mar 5, 07:03 | |
| John Engelhart | Mar 5, 20:48 |






Cocoa mail archive

