Skip navigation.
 
mlRe: regexkit [Using NSPredicate to parse strings]
FROM : John Engelhart
DATE : Wed Mar 05 05:55:33 2008

On Mar 4, 2008, at 12:50 PM, Jens Alfke wrote:
>
> On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
>

>> That is a seriously good framework, and the documentation is great 
>> too.

>
> My only issue with regexkit is that it uses PCRE instead of ICU.


[disclosure: I am the author of RegexKit]

Hard to make everyone happy.  :)

> PCRE has to be compiled into the library, making it larger (whereas 
> ICU is already built into the OS.)


True, and not only compiled it, but compiled in four times for the 
four 10.5 architectures (ppc, ppc-64, i386, x86-64).  The framework 
with all four architectures ends up clocking in at ~1.4MB, whereas 
just the 32 bit architectures comes in around 680KB.  That's roughly 
371KB per architecture, of which only about 35% (~130KB) is the PCRE 
library itself believe it or not.  Everything is compiled '-Oz' (sic) 
and dead code stripped to squeeze it down as much as possible.

The latest 0.6.0 release of RegexKit put in motion a few API changes, 
one in particular that's relevant to this discussion is the "library:" 
argument to the method selectors.  This is a forward looking change to 
support more than just the PCRE library, with the obvious candidate 
being the ICU library shipped with the system.  I actually have ICU 
support hacked in to the in-development version of RegexKit, but I'm 
not particularly happy with it, so I'm not sure if I'm going to 
squeeze it in to the next release.  Some of my concerns are:

o It's sort of ambiguous if the /usr/lib/libicucore library is 
'supported' or not.  I believe the general consensus is that it's not 
really there for public use, hence the missing headers, but it's also 
not verboten.  Even light weight versions of the ICU library are 
several orders of magnitude larger than PCRE, libicucore is 6.5 
megabytes for the 4 archs, or about 1.65MB per arch- more than the 
size of RegexKit for all archs.

o The ICU Regex C API (the one I need to use for RegexKit, not the C++ 
one, which I haven't really looked at) is very multi-threading 
unfriendly.  Basically, the 'compiled' regex, the string being 
matched, and the current match state are all wrapped up in the same 
opaque compiled regex pointer.  Since RegexKit is designed to be 
extremely multi-threading friendly, this presents a bit of a problem. 
Actually, quite a bit of a non-trivial problem, at least if you want 
it done in something you'd consider 'fast'.  PCRE, in contrast, 
compiles a regex in to a stateless form, which can be used by any 
thread without any special caveats.  RegexKit keeps this compiled form 
in a special multi-threading aware/safe cache, so a regex is only 
compiled once- the first time it's used, after that all threads use 
the cached form.

o RegexKit spends considerable effort in trying to get access to the 
raw NSString buffer, to avoid unnecessary creation and destruction of 
temporary buffers to perform a match.  PCRE only works with UTF-8 
encoded strings, while ICU only works in UTF-16.  Though very heavily 
usage dependent, in practice (this coming from a north american, ASCII 
and English native speaker mind you) most NSStrings buffers tend to be 
in a UTF-8 compatible form, allowing fast access by PCRE.  Using ICU 
would require the creation of, and conversion to UTF-16 for most 
strings (again, usage dependent), only to be released/freed right 
after use.

I tend to find that people typically don't use a regex as a simple, 
one shot operation.  They tend to be used repeatedly, often, and a 
lot.  Examples are text processing and performing matches on every 
line in a file, performing several regex operations per line.  For 
example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses 
RegexKit as its regex matching engine.  This involves a list of about 
500 regexes (depending on which adblock lists you've subscribed to) 
that need to be executed for every URL.  A typical web page can 
contain 50 - 100 URL's, and the worst case is 'Not matched by any of 
the regexes', requiring all of them to be evaluated.  That's 25,000 to 
50,000 regex matches per web page for the worst case.  This is why 
AdBlock in Firefox can result in a performance hit, but in my testing 
Safari AdBlock actually results in pages loading faster even with all 
the extra work.

I put quite a bit of effort in to making this kind of matching 'go 
fast', which includes sorting the regexes in the list (NSArray or 
NSSet) by the number of times each one has successfully matched, 
trying the ones that have matched the most first.  This 'auto-
optimizes' the order in which the regexes are tried.  It also keeps a 
small LRU cache of 'misses', so if a URL appears more than once but 
wasn't blocked, it can bypass the whole regex evaluation process.  On 
top of all of this, in keeping with the multi-threading friendly 
nature, it automagically parallelizes the matching process- one match 
attempt per CPU since each match is independent of all the others. 
It's also simple to tap in to this high speed matching: [@"string to 
check" isMatchedByAnyRegexInArray:arrayOfRegexes], along with a few 
other methods.

> PCRE is also, last I checked, less I18N-savvy than ICU. This has 
> given me grief in the past; I used PRCE-based regex code in a 
> project 3 years ago, and as soon as the Japanese and Korean testers 
> started working with it, they found that the app's text searching 
> didn't work correctly for them. (In a nutshell, PCRE's notion of 
> "alphabetic characters" and "word breaks" only works for Roman 
> writing systems.)


Again, I'm a native English only speaker, so I will happily defer to 
just about anyone else on these points.  My zero-order approximation 
read on the ICU vs. PCRE on this issue leads me to think that they are 
essentially equal.  However, PCRE and ICU define 'word' and 'non-
word' (the regex escape sequence \w and \W), and consequently the 
'(non-)word break' (escape sequence \b and \B) very differently. 
Specifically, PCRE defines word and non-word in terms of ASCII 
encoding ONLY, whereas ICU does not (though I can't find a handy link 
to the exact definition used by ICU, it's obviously more than just 
ASCII).  I suspect the PCRE definition is rooted in compatibility 
concerns.

However, I suspect that the functional equivalent could be constructed 
using the \p{} and \P{} (\P being the "with out" version the \p "with" 
form).  I suspect an ICU \w analog in PCRE would be \p (match a 
Unicode Letter), and \b could be simulated with something like:

(?<=\p)(?=\P)

Translated to: A positive look-behind (the character just before this 
point in the regex) must be a Unicode Character and a positive look-
ahead (the next character, without 'consuming' the input, must not be 
a unicode character).  Definitely not as elegant, but I suspect 
passable.

As an aside, RegexKit includes the PCRE build time options for UTF-8 
Unicode and Unicode properties.  Since Foundation uses UTF-16 as its 
abstract representation for all strings, and thus their NSRange 
values, and PCRE uses UTF-8 as its representation, RegexKit tries to 
do 'the right thing' and automatically hide the conversion details for 
things such as NSRange where appropriate.  As a rule of thumb, if it's 
Foundation, RegexKit takes as input and provides UTF-16 based indexes 
and NSRanges, but uses UTF-8 if matching against a raw byte buffer 
(typically only the RKRegex class does this).  So, all the NSString 
category additions use Foundations native UTF-16 representation, 
allowing seamless interoperability with other NSString methods that 
use NSRange values.

Also along the i18n train of thought, I started a big push for 
localization for RegexKit 0.6.0.  This centered around adding "error:" 
arguments to methods so they could return standard NSError objects, 
and localizing all the strings RegexKit uses.  In addition to this, I 
re-wrote the text of all the PCRE error strings so they were up to 
"Aqua UI" standards while at the same time putting them in to .strings 
localized form.  I obviously can't localize the strings, but it should 
make it easier for anyone who does have to use RegexKit and requires 
localized regex error strings.  The goal is to make it a simple matter 
of handing off the returned NSError object to whatever alert display 
mechanism you want and get a high quality error dialog.  This is a 
work in progress right now, however, as it required updating a lot of 
the API to return NSError objects (which, in hindsight, I should have 
done in the first place.  Live and learn.)

> Unfortunately I don't know of a comparable Cocoa regex library that 
> uses ICU. (NSPredicate does, but its support for regexes is very 
> limited, as already discussed in this thread.)


RegexKit will likely include this in the future, though I won't 
promise the next release.  Right now I'm not happy with the hackery 
required to wedge it in to RegexKit.  For those interested, Oniguruma 
is probably right out.  It does not seem to be terribly multi-
threading friendly, with the easiest solution to "giant lock" all 
calls to the library.  Not exactly compatible with the goals of 
RegexKit.

Related mailsAuthorDate
mlUsing NSPredicate to parse strings Jonathan Dann Mar 3, 16:37
mlRe: Using NSPredicate to parse strings Mike Abdullah Mar 3, 17:16
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 3, 19:12
mlRe: Using NSPredicate to parse strings Dave Camp Mar 3, 19:23
mlRe: Using NSPredicate to parse strings Jonathan Dann Mar 4, 12:25
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 18:50
mlRe: regexkit [Using NSPredicate to parse strings] Jonathan Dann Mar 4, 19:19
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 4, 22:08
mlRe: regexkit [Using NSPredicate to parse strings] glenn andreas Mar 4, 22:33
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 05:55
mlRe: regexkit [Using NSPredicate to parse strings] Jens Alfke Mar 5, 07:03
mlRe: regexkit [Using NSPredicate to parse strings] John Engelhart Mar 5, 20:48