FROM : Jens Alfke
DATE : Wed Mar 05 07:03:09 2008
On 4 Mar '08, at 8:55 PM, John Engelhart wrote:
> It's sort of ambiguous if the /usr/lib/libicucore library is
> 'supported' or not. I believe the general consensus is that it's
> not really there for public use, hence the missing headers, but it's
> also not verboten.
Yeah, this is annoying. I don't know the reason for omitting the
headers; Deborah Goldsmith would know (she's the ICU expert at Apple)
but I don't know whether she reads this list.
> The ICU Regex C API (the one I need to use for RegexKit, not the C++
> one, which I haven't really looked at) is very multi-threading
> unfriendly. Basically, the 'compiled' regex, the string being
> matched, and the current match state are all wrapped up in the same
> opaque compiled regex pointer.
Well, I'm pretty multi-threading unfriendly myself, so that hasn't
been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex
object, so you can compile it once and peel off a new copy for every
string you need to match. (I wrote, but never finished, a Cocoa ICU
wrapper before I left Apple, and I think that was my solution to the
state problem.)
> RegexKit spends considerable effort in trying to get access to the
> raw NSString buffer, to avoid unnecessary creation and destruction
> of temporary buffers to perform a match.
This is definitely a concern. I suspect this is the major reason there
isn't an NSRegularExpression API yet; there's been talk of enhancing
the ICU regex API to make it more flexible in how it accepts strings;
but IMHO waiting for this is a case of "the best being the enemy of
the good".
> PCRE only works with UTF-8 encoded strings, while ICU only works in
> UTF-16. [...] most NSStrings buffers tend to be in a UTF-8
> compatible form, allowing fast access by PCRE. Using ICU would
> require the creation of, and conversion to UTF-16 for most strings
> (again, usage dependent), only to be released/freed right after use.
I looked into this once. CFStrings (and NSStrings) are stored in one
of two formats: (1) UTF-16, or (2) the "default C encoding". The
latter varies by what your current locale is, but it defaults to ...
MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are
*never* stored in UTF-8 form, at least not in English-speaking
locales. (On the other hand, CFString is fairly smart about encodings,
so if the string is all-ascii, it realizes that's compatible with
UTF-8 and can return the raw buffer if you ask for UTF-8.)
In my limited experiments, most strings I looked at were being stored
in UTF-16. But it's heavily dependent on how the strings were created
and what characters they contain, so YMMV.
> For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
> uses RegexKit as its regex matching engine. This involves a list of
> about 500 regexes (depending on which adblock lists you've
> subscribed to) that need to be executed for every URL.
Um, can't you merge those together into a single regex by joining them
together with "or" operators? (That's a fairly typical trick that
lexers use.)
> My zero-order approximation read on the ICU vs. PCRE on this issue
> leads me to think that they are essentially equal. However, PCRE
> and ICU define 'word' and 'non-word' (the regex escape sequence \w
> and \W), and consequently the '(non-)word break' (escape sequence \b
> and \B) very differently. Specifically, PCRE defines word and non-
> word in terms of ASCII encoding ONLY, whereas ICU does not
What you're saying is that they're essentially equal, except for non-
ascii characters :)
ICU takes Unicode very, very seriously; that's its raison d'ętre. It's
the International Components for Unicode. Regexes are just one of the
things it does.
> Translated to: A positive look-behind (the character just before
> this point in the regex) must be a Unicode Character and a positive
> look-ahead (the next character, without 'consuming' the input, must
> not be a unicode character). Definitely not as elegant, but I
> suspect passable.
Nope. As I said, several languages (including Japanese) have word-
break rules that are more complex than this. Multiple words run
together without any non-word characters in between. You have to use
per-language heuristics to find the breaks. (My understanding is that
Thai is especially nasty, practically requiring the use of a
dictionary to tweeze apart the individual words.)
And as I said, this isn't just hypothetical. It became a Priority 1,
stop-the-presses bug for my project in 2005 as soon as the Japanese
testers started trying out the functionality that used PCRE and
discovered that it didn't work.
—Jens
DATE : Wed Mar 05 07:03:09 2008
On 4 Mar '08, at 8:55 PM, John Engelhart wrote:
> It's sort of ambiguous if the /usr/lib/libicucore library is
> 'supported' or not. I believe the general consensus is that it's
> not really there for public use, hence the missing headers, but it's
> also not verboten.
Yeah, this is annoying. I don't know the reason for omitting the
headers; Deborah Goldsmith would know (she's the ICU expert at Apple)
but I don't know whether she reads this list.
> The ICU Regex C API (the one I need to use for RegexKit, not the C++
> one, which I haven't really looked at) is very multi-threading
> unfriendly. Basically, the 'compiled' regex, the string being
> matched, and the current match state are all wrapped up in the same
> opaque compiled regex pointer.
Well, I'm pretty multi-threading unfriendly myself, so that hasn't
been a concern for me ;-)
But seriously, IIRC there is a way to cheaply clone an ICU regex
object, so you can compile it once and peel off a new copy for every
string you need to match. (I wrote, but never finished, a Cocoa ICU
wrapper before I left Apple, and I think that was my solution to the
state problem.)
> RegexKit spends considerable effort in trying to get access to the
> raw NSString buffer, to avoid unnecessary creation and destruction
> of temporary buffers to perform a match.
This is definitely a concern. I suspect this is the major reason there
isn't an NSRegularExpression API yet; there's been talk of enhancing
the ICU regex API to make it more flexible in how it accepts strings;
but IMHO waiting for this is a case of "the best being the enemy of
the good".
> PCRE only works with UTF-8 encoded strings, while ICU only works in
> UTF-16. [...] most NSStrings buffers tend to be in a UTF-8
> compatible form, allowing fast access by PCRE. Using ICU would
> require the creation of, and conversion to UTF-16 for most strings
> (again, usage dependent), only to be released/freed right after use.
I looked into this once. CFStrings (and NSStrings) are stored in one
of two formats: (1) UTF-16, or (2) the "default C encoding". The
latter varies by what your current locale is, but it defaults to ...
MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are
*never* stored in UTF-8 form, at least not in English-speaking
locales. (On the other hand, CFString is fairly smart about encodings,
so if the string is all-ascii, it realizes that's compatible with
UTF-8 and can return the raw buffer if you ask for UTF-8.)
In my limited experiments, most strings I looked at were being stored
in UTF-16. But it's heavily dependent on how the strings were created
and what characters they contain, so YMMV.
> For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
> uses RegexKit as its regex matching engine. This involves a list of
> about 500 regexes (depending on which adblock lists you've
> subscribed to) that need to be executed for every URL.
Um, can't you merge those together into a single regex by joining them
together with "or" operators? (That's a fairly typical trick that
lexers use.)
> My zero-order approximation read on the ICU vs. PCRE on this issue
> leads me to think that they are essentially equal. However, PCRE
> and ICU define 'word' and 'non-word' (the regex escape sequence \w
> and \W), and consequently the '(non-)word break' (escape sequence \b
> and \B) very differently. Specifically, PCRE defines word and non-
> word in terms of ASCII encoding ONLY, whereas ICU does not
What you're saying is that they're essentially equal, except for non-
ascii characters :)
ICU takes Unicode very, very seriously; that's its raison d'ętre. It's
the International Components for Unicode. Regexes are just one of the
things it does.
> Translated to: A positive look-behind (the character just before
> this point in the regex) must be a Unicode Character and a positive
> look-ahead (the next character, without 'consuming' the input, must
> not be a unicode character). Definitely not as elegant, but I
> suspect passable.
Nope. As I said, several languages (including Japanese) have word-
break rules that are more complex than this. Multiple words run
together without any non-word characters in between. You have to use
per-language heuristics to find the breaks. (My understanding is that
Thai is especially nasty, practically requiring the use of a
dictionary to tweeze apart the individual words.)
And as I said, this isn't just hypothetical. It became a Priority 1,
stop-the-presses bug for my project in 2005 as soon as the Japanese
testers started trying out the functionality that used PCRE and
discovered that it didn't work.
—Jens
| Related mails | Author | Date |
|---|---|---|
| Jonathan Dann | Mar 3, 16:37 | |
| Mike Abdullah | Mar 3, 17:16 | |
| Jonathan Dann | Mar 3, 19:12 | |
| Dave Camp | Mar 3, 19:23 | |
| Jonathan Dann | Mar 4, 12:25 | |
| Jens Alfke | Mar 4, 18:50 | |
| Jonathan Dann | Mar 4, 19:19 | |
| Jens Alfke | Mar 4, 22:08 | |
| glenn andreas | Mar 4, 22:33 | |
| John Engelhart | Mar 5, 05:55 | |
| Jens Alfke | Mar 5, 07:03 | |
| John Engelhart | Mar 5, 20:48 |






Cocoa mail archive

