Skip navigation.
 
mlRe: regexkit [Using NSPredicate to parse strings]
FROM : Jens Alfke
DATE : Wed Mar 05 07:03:09 2008

On 4 Mar '08, at 8:55 PM, John Engelhart wrote:

> It's sort of ambiguous if the /usr/lib/libicucore library is 
> 'supported' or not.  I believe the general consensus is that it's 
> not really there for public use, hence the missing headers, but it's 
> also not verboten.


Yeah, this is annoying. I don't know the reason for omitting the 
headers; Deborah Goldsmith would know (she's the ICU expert at Apple) 
but I don't know whether she reads this list.

> The ICU Regex C API (the one I need to use for RegexKit, not the C++ 
> one, which I haven't really looked at) is very multi-threading 
> unfriendly.  Basically, the 'compiled' regex, the string being 
> matched, and the current match state are all wrapped up in the same 
> opaque compiled regex pointer.


Well, I'm pretty multi-threading unfriendly myself, so that hasn't 
been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex 
object, so you can compile it once and peel off a new copy for every 
string you need to match. (I wrote, but never finished, a Cocoa ICU 
wrapper before I left Apple, and I think that was my solution to the 
state problem.)

> RegexKit spends considerable effort in trying to get access to the 
> raw NSString buffer, to avoid unnecessary creation and destruction 
> of temporary buffers to perform a match.


This is definitely a concern. I suspect this is the major reason there 
isn't an NSRegularExpression API yet; there's been talk of enhancing 
the ICU regex API to make it more flexible in how it accepts strings; 
but IMHO waiting for this is a case of "the best being the enemy of 
the good".

> PCRE only works with UTF-8 encoded strings, while ICU only works in 
> UTF-16.  [...] most NSStrings buffers tend to be in a UTF-8 
> compatible form, allowing fast access by PCRE.  Using ICU would 
> require the creation of, and conversion to UTF-16 for most strings 
> (again, usage dependent), only to be released/freed right after use.


I looked into this once. CFStrings (and NSStrings) are stored in one 
of two formats: (1) UTF-16, or (2) the "default C encoding". The 
latter varies by what your current locale is, but it defaults to ... 
MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are 
*never* stored in UTF-8 form, at least not in English-speaking 
locales. (On the other hand, CFString is fairly smart about encodings, 
so if the string is all-ascii, it realizes that's compatible with 
UTF-8 and can return the raw buffer if you ask for UTF-8.)

In my limited experiments, most strings I looked at were being stored 
in UTF-16. But it's heavily dependent on how the strings were created 
and what characters they contain, so YMMV.

> For example, Safari AdBlock (http://safariadblock.sourceforge.net/
> uses RegexKit as its regex matching engine.  This involves a list of 
> about 500 regexes (depending on which adblock lists you've 
> subscribed to) that need to be executed for every URL.


Um, can't you merge those together into a single regex by joining them 
together with "or" operators? (That's a fairly typical trick that 
lexers use.)

> My zero-order approximation read on the ICU vs. PCRE on this issue 
> leads me to think that they are essentially equal.  However, PCRE 
> and ICU define 'word' and 'non-word' (the regex escape sequence \w 
> and \W), and consequently the '(non-)word break' (escape sequence \b 
> and \B) very differently.  Specifically, PCRE defines word and non-
> word in terms of ASCII encoding ONLY, whereas ICU does not


What you're saying is that they're essentially equal, except for non-
ascii characters :)

ICU takes Unicode very, very seriously; that's its raison d'ętre. It's 
the International Components for Unicode. Regexes are just one of the 
things it does.

> Translated to: A positive look-behind (the character just before 
> this point in the regex) must be a Unicode Character and a positive 
> look-ahead (the next character, without 'consuming' the input, must 
> not be a unicode character).  Definitely not as elegant, but I 
> suspect passable.


Nope. As I said, several languages (including Japanese) have word-
break rules that are more complex than this. Multiple words run 
together without any non-word characters in between. You have to use 
per-language heuristics to find the breaks. (My understanding is that 
Thai is especially nasty, practically requiring the use of a 
dictionary to tweeze apart the individual words.)

And as I said, this isn't just hypothetical. It became a Priority 1, 
stop-the-presses bug for my project in 2005 as soon as the Japanese 
testers started trying out the functionality that used PCRE and 
discovered that it didn't work.

—Jens

Related mailsAuthorDate
mlUsing NSPredicate to parse strings Jonathan Dann Mar 3, 16:37
mlRe: Using NSPredicate to parse strings Mike Abdullah Mar 3, 17:16
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 3, 19:12
mlRe: Using NSPredicate to parse strings Dave Camp Mar 3, 19:23
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 4, 12:25
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 18:50
mlRe: regexkit [Using NSPredicate to parse strings] Jonathan Dann Mar 4, 19:19
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 22:08
mlRe: regexkit [Using NSPredicate to parse strings] glenn andreas Mar 4, 22:33
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 05:55
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 5, 07:03
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 20:48