Skip navigation.
 
mlRe: regexkit [Using NSPredicate to parse strings]
FROM : Jens Alfke
DATE : Tue Mar 04 22:08:29 2008

On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:

> I'm most-likely going to have to support many text-encodings.  Say 
> if I'm writing a document in Jaspanese (Mac OS), will I have to 
> convert that to UTF-8 before the methods of something like RegexKit 
> would work?  Any caveats you know of that I need to be aware of? I'm 
> learning by doing.


It's not the encoding that's an issue, at least not at the point 
you're running a regex. Presumably you had to deal with encodings just 
to get the data into an NSString in the first place.

The limitation of PCRE is in its handling of character classes. IIRC, 
PCRE doesn't consider any character above 0x7F to be alphanumeric, so 
regex character types like "\w" won't match non-ascii letters. Worse, 
it detects word boundaries ("\b") by looking for a transition between 
word and non-word characters. Here the problem isn't just that it 
doesn't know about non-ascii word characters; it's that some languages 
have more complex rules for detecting word breaks. In Japanese and 
Thai, for example, words are often written without spaces in between 
them, and you have to use linguistic rules to determine where the 
breaks go. ICU knows how to do this.

The problem I ran into with PCRE is that I was implementing a typical 
filter field (the one in Safari RSS) that needed to match word 
prefixes. So the search regex began with "\b" to match the word break. 
But it didn't work correctly on most kanji text.

(Now, this was a few years ago. It's possible that PCRE's Unicode 
support has been improved since. If  this is important to you, go 
check the docs.)

—Jens

Related mailsAuthorDate
mlUsing NSPredicate to parse strings Jonathan Dann Mar 3, 16:37
mlRe: Using NSPredicate to parse strings Mike Abdullah Mar 3, 17:16
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 3, 19:12
mlRe: Using NSPredicate to parse strings Dave Camp Mar 3, 19:23
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 4, 12:25
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 18:50
mlRe: regexkit [Using NSPredicate to parse strings] Jonathan Dann Mar 4, 19:19
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 22:08
mlRe: regexkit [Using NSPredicate to parse strings] glenn andreas Mar 4, 22:33
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 05:55
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 5, 07:03
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 20:48