Using NSPredicate to parse strings

  • Hi Guys,

    I'm trying to find a way to use NSPredicate to search an NSString for
    all occurrences of a string and return them to me. Ideally I need the
    returned strings ranges too.

    Is this possible? I can get is to tell me that a regex match is found
    using a predicate with the format @"SELF MATCHES %@",myRegex and
    evaluating my plain text document string, but now I'm stuck.

    Any help would be grand, thanks.

    Jonathan Dann
  • Jonathon, you'll have much better luck with NSScanner. It's designed
    for exactly what you want.

    Mike.

    On 3 Mar 2008, at 15:37, Jonathan Dann wrote:

    > Hi Guys,
    >
    > I'm trying to find a way to use NSPredicate to search an NSString
    > for all occurrences of a string and return them to me. Ideally I
    > need the returned strings ranges too.
    >
    > Is this possible? I can get is to tell me that a regex match is
    > found using a predicate with the format @"SELF MATCHES %@",myRegex
    > and evaluating my plain text document string, but now I'm stuck.
    >
    > Any help would be grand, thanks.
    >
    > Jonathan Dann
  • On 3 Mar 2008, at 16:16, Mike Abdullah <cocoadev...>
    wrote:

    > Jonathon, you'll have much better luck with NSScanner. It's designed
    > for exactly what you want.
    >
    > Mike.
    >
    >>
    >>
    >>
    >>
    >>
    >>
    >>
    >>

    Thanks Mike, just tried it and it works quite well.  Any way of using
    NSScanner directly with regex? Not sure if its really necessary for my
    code but would probably be very useful, alternatively I'll just use
    NSPredicate to avoid scanning for stuff that isn't there.

    Much appreciated,

    Jon
  • On Mar 3, 2008, at 10:12 AM, Jonathan Dann wrote:

    > On 3 Mar 2008, at 16:16, Mike Abdullah <cocoadev...>
    > wrote:
    >
    >> Jonathon, you'll have much better luck with NSScanner. It's
    >> designed for exactly what you want.
    >>
    >> Mike.
    >>
    >>
    > Thanks Mike, just tried it and it works quite well.  Any way of
    > using NSScanner directly with regex? Not sure if its really
    > necessary for my code but would probably be very useful,
    > alternatively I'll just use NSPredicate to avoid scanning for stuff
    > that isn't there.

    I'd highly recommend <https://sourceforge.net/projects/regexkit/>

    Dave
  • On 3 Mar 2008, at 18:23, Dave Camp <dave...> wrote:

    > On Mar 3, 2008, at 10:12 AM, Jonathan Dann wrote:
    >
    >> On 3 Mar 2008, at 16:16, Mike Abdullah <cocoadev...>
    >> wrote:
    >>
    >>> Jonathon, you'll have much better luck with NSScanner. It's
    >>> designed for exactly what you want.
    >>>
    >>> Mike.
    >>>
    >>>
    >> Thanks Mike, just tried it and it works quite well.  Any way of
    >> using NSScanner directly with regex? Not sure if its really
    >> necessary for my code but would probably be very useful,
    >> alternatively I'll just use NSPredicate to avoid scanning for stuff
    >> that isn't there.
    >
    > I'd highly recommend <https://sourceforge.net/projects/regexkit/>
    >
    > Dave
    >

    Dave, you're a legend!

    That is a seriously good framework, and the documentation is great too.

    Much appreciated,

    Jon
  • On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:

    > That is a seriously good framework, and the documentation is great
    > too.

    My only issue with regexkit is that it uses PCRE instead of ICU.

    PCRE has to be compiled into the library, making it larger (whereas
    ICU is already built into the OS.)

    PCRE is also, last I checked, less I18N-savvy than ICU. This has given
    me grief in the past; I used PRCE-based regex code in a project 3
    years ago, and as soon as the Japanese and Korean testers started
    working with it, they found that the app's text searching didn't work
    correctly for them. (In a nutshell, PCRE's notion of "alphabetic
    characters" and "word breaks" only works for Roman writing systems.)

    Unfortunately I don't know of a comparable Cocoa regex library that
    uses ICU. (NSPredicate does, but its support for regexes is very
    limited, as already discussed in this thread.)

    —Jens
  • On 4 Mar 2008, at 17:50, Jens Alfke wrote:

    >
    > On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
    >
    >> That is a seriously good framework, and the documentation is great
    >> too.
    >
    > My only issue with regexkit is that it uses PCRE instead of ICU.
    >
    > PCRE has to be compiled into the library, making it larger (whereas
    > ICU is already built into the OS.)
    >
    > PCRE is also, last I checked, less I18N-savvy than ICU. This has
    > given me grief in the past; I used PRCE-based regex code in a
    > project 3 years ago, and as soon as the Japanese and Korean testers
    > started working with it, they found that the app's text searching
    > didn't work correctly for them. (In a nutshell, PCRE's notion of
    > "alphabetic characters" and "word breaks" only works for Roman
    > writing systems.)
    >
    > Unfortunately I don't know of a comparable Cocoa regex library that
    > uses ICU. (NSPredicate does, but its support for regexes is very
    > limited, as already discussed in this thread.)
    >
    > —Jens

    Thanks for the heads-up Jens, that will probably be an issue for me in
    the future, I'm going to localise my app for as many languages as I
    can so I'll have to find out how a few non-Roman-typing users will be
    using it and if it will affect them.

    I'm most-likely going to have to support many text-encodings.  Say if
    I'm writing a document in Jaspanese (Mac OS), will I have to convert
    that to UTF-8 before the methods of something like RegexKit would
    work?  Any caveats you know of that I need to be aware of? I'm
    learning by doing.

    Thanks for taking the time to mention the PCRE thing, I appreciate it,

    Jon
  • On 4 Mar '08, at 10:19 AM, Jonathan Dann wrote:

    > I'm most-likely going to have to support many text-encodings.  Say
    > if I'm writing a document in Jaspanese (Mac OS), will I have to
    > convert that to UTF-8 before the methods of something like RegexKit
    > would work?  Any caveats you know of that I need to be aware of? I'm
    > learning by doing.

    It's not the encoding that's an issue, at least not at the point
    you're running a regex. Presumably you had to deal with encodings just
    to get the data into an NSString in the first place.

    The limitation of PCRE is in its handling of character classes. IIRC,
    PCRE doesn't consider any character above 0x7F to be alphanumeric, so
    regex character types like "\w" won't match non-ascii letters. Worse,
    it detects word boundaries ("\b") by looking for a transition between
    word and non-word characters. Here the problem isn't just that it
    doesn't know about non-ascii word characters; it's that some languages
    have more complex rules for detecting word breaks. In Japanese and
    Thai, for example, words are often written without spaces in between
    them, and you have to use linguistic rules to determine where the
    breaks go. ICU knows how to do this.

    The problem I ran into with PCRE is that I was implementing a typical
    filter field (the one in Safari RSS) that needed to match word
    prefixes. So the search regex began with "\b" to match the word break.
    But it didn't work correctly on most kanji text.

    (Now, this was a few years ago. It's possible that PCRE's Unicode
    support has been improved since. If  this is important to you, go
    check the docs.)

    —Jens
  • On Mar 4, 2008, at 11:50 AM, Jens Alfke wrote:

    >
    > On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
    >
    >> That is a seriously good framework, and the documentation is great
    >> too.
    >
    > My only issue with regexkit is that it uses PCRE instead of ICU.
    >
    > PCRE has to be compiled into the library, making it larger (whereas
    > ICU is already built into the OS.)
    >
    > PCRE is also, last I checked, less I18N-savvy than ICU. This has
    > given me grief in the past; I used PRCE-based regex code in a
    > project 3 years ago, and as soon as the Japanese and Korean testers
    > started working with it, they found that the app's text searching
    > didn't work correctly for them. (In a nutshell, PCRE's notion of
    > "alphabetic characters" and "word breaks" only works for Roman
    > writing systems.)
    >
    > Unfortunately I don't know of a comparable Cocoa regex library that
    > uses ICU. (NSPredicate does, but its support for regexes is very
    > limited, as already discussed in this thread.)

    Have you seen <http://aarone.org/cocoaicu/>?

    I ran across it a while ago, but haven't had a chance to try it out
    personally (one should read <http://aarone.org/2006/12/10/libicucore-on-mac-os-x/> first as well)

    Glenn Andreas                      <gandreas...>
      <http://www.gandreas.com/> wicked fun!
    Cardographer | the custom playing card designer
  • On Mar 4, 2008, at 12:50 PM, Jens Alfke wrote:
    >
    > On 4 Mar '08, at 3:25 AM, Jonathan Dann wrote:
    >
    >> That is a seriously good framework, and the documentation is great
    >> too.
    >
    > My only issue with regexkit is that it uses PCRE instead of ICU.

    [disclosure: I am the author of RegexKit]

    Hard to make everyone happy.  :)

    > PCRE has to be compiled into the library, making it larger (whereas
    > ICU is already built into the OS.)

    True, and not only compiled it, but compiled in four times for the
    four 10.5 architectures (ppc, ppc-64, i386, x86-64).  The framework
    with all four architectures ends up clocking in at ~1.4MB, whereas
    just the 32 bit architectures comes in around 680KB.  That's roughly
    371KB per architecture, of which only about 35% (~130KB) is the PCRE
    library itself believe it or not.  Everything is compiled '-Oz' (sic)
    and dead code stripped to squeeze it down as much as possible.

    The latest 0.6.0 release of RegexKit put in motion a few API changes,
    one in particular that's relevant to this discussion is the "library:"
    argument to the method selectors.  This is a forward looking change to
    support more than just the PCRE library, with the obvious candidate
    being the ICU library shipped with the system.  I actually have ICU
    support hacked in to the in-development version of RegexKit, but I'm
    not particularly happy with it, so I'm not sure if I'm going to
    squeeze it in to the next release.  Some of my concerns are:

    o It's sort of ambiguous if the /usr/lib/libicucore library is
    'supported' or not.  I believe the general consensus is that it's not
    really there for public use, hence the missing headers, but it's also
    not verboten.  Even light weight versions of the ICU library are
    several orders of magnitude larger than PCRE, libicucore is 6.5
    megabytes for the 4 archs, or about 1.65MB per arch- more than the
    size of RegexKit for all archs.

    o The ICU Regex C API (the one I need to use for RegexKit, not the C++
    one, which I haven't really looked at) is very multi-threading
    unfriendly.  Basically, the 'compiled' regex, the string being
    matched, and the current match state are all wrapped up in the same
    opaque compiled regex pointer.  Since RegexKit is designed to be
    extremely multi-threading friendly, this presents a bit of a problem.
    Actually, quite a bit of a non-trivial problem, at least if you want
    it done in something you'd consider 'fast'.  PCRE, in contrast,
    compiles a regex in to a stateless form, which can be used by any
    thread without any special caveats.  RegexKit keeps this compiled form
    in a special multi-threading aware/safe cache, so a regex is only
    compiled once- the first time it's used, after that all threads use
    the cached form.

    o RegexKit spends considerable effort in trying to get access to the
    raw NSString buffer, to avoid unnecessary creation and destruction of
    temporary buffers to perform a match.  PCRE only works with UTF-8
    encoded strings, while ICU only works in UTF-16.  Though very heavily
    usage dependent, in practice (this coming from a north american, ASCII
    and English native speaker mind you) most NSStrings buffers tend to be
    in a UTF-8 compatible form, allowing fast access by PCRE.  Using ICU
    would require the creation of, and conversion to UTF-16 for most
    strings (again, usage dependent), only to be released/freed right
    after use.

    I tend to find that people typically don't use a regex as a simple,
    one shot operation.  They tend to be used repeatedly, often, and a
    lot.  Examples are text processing and performing matches on every
    line in a file, performing several regex operations per line.  For
    example, Safari AdBlock (http://safariadblock.sourceforge.net/) uses
    RegexKit as its regex matching engine.  This involves a list of about
    500 regexes (depending on which adblock lists you've subscribed to)
    that need to be executed for every URL.  A typical web page can
    contain 50 - 100 URL's, and the worst case is 'Not matched by any of
    the regexes', requiring all of them to be evaluated.  That's 25,000 to
    50,000 regex matches per web page for the worst case.  This is why
    AdBlock in Firefox can result in a performance hit, but in my testing
    Safari AdBlock actually results in pages loading faster even with all
    the extra work.

    I put quite a bit of effort in to making this kind of matching 'go
    fast', which includes sorting the regexes in the list (NSArray or
    NSSet) by the number of times each one has successfully matched,
    trying the ones that have matched the most first.  This 'auto-
    optimizes' the order in which the regexes are tried.  It also keeps a
    small LRU cache of 'misses', so if a URL appears more than once but
    wasn't blocked, it can bypass the whole regex evaluation process.  On
    top of all of this, in keeping with the multi-threading friendly
    nature, it automagically parallelizes the matching process- one match
    attempt per CPU since each match is independent of all the others.
    It's also simple to tap in to this high speed matching: [@"string to
    check" isMatchedByAnyRegexInArray:arrayOfRegexes], along with a few
    other methods.

    > PCRE is also, last I checked, less I18N-savvy than ICU. This has
    > given me grief in the past; I used PRCE-based regex code in a
    > project 3 years ago, and as soon as the Japanese and Korean testers
    > started working with it, they found that the app's text searching
    > didn't work correctly for them. (In a nutshell, PCRE's notion of
    > "alphabetic characters" and "word breaks" only works for Roman
    > writing systems.)

    Again, I'm a native English only speaker, so I will happily defer to
    just about anyone else on these points.  My zero-order approximation
    read on the ICU vs. PCRE on this issue leads me to think that they are
    essentially equal.  However, PCRE and ICU define 'word' and 'non-
    word' (the regex escape sequence \w and \W), and consequently the
    '(non-)word break' (escape sequence \b and \B) very differently.
    Specifically, PCRE defines word and non-word in terms of ASCII
    encoding ONLY, whereas ICU does not (though I can't find a handy link
    to the exact definition used by ICU, it's obviously more than just
    ASCII).  I suspect the PCRE definition is rooted in compatibility
    concerns.

    However, I suspect that the functional equivalent could be constructed
    using the \p{} and \P{} (\P being the "with out" version the \p "with"
    form).  I suspect an ICU \w analog in PCRE would be \p{L} (match a
    Unicode Letter), and \b could be simulated with something like:

    (?<=\p{L})(?=\P{L})

    Translated to: A positive look-behind (the character just before this
    point in the regex) must be a Unicode Character and a positive look-
    ahead (the next character, without 'consuming' the input, must not be
    a unicode character).  Definitely not as elegant, but I suspect
    passable.

    As an aside, RegexKit includes the PCRE build time options for UTF-8
    Unicode and Unicode properties.  Since Foundation uses UTF-16 as its
    abstract representation for all strings, and thus their NSRange
    values, and PCRE uses UTF-8 as its representation, RegexKit tries to
    do 'the right thing' and automatically hide the conversion details for
    things such as NSRange where appropriate.  As a rule of thumb, if it's
    Foundation, RegexKit takes as input and provides UTF-16 based indexes
    and NSRanges, but uses UTF-8 if matching against a raw byte buffer
    (typically only the RKRegex class does this).  So, all the NSString
    category additions use Foundations native UTF-16 representation,
    allowing seamless interoperability with other NSString methods that
    use NSRange values.

    Also along the i18n train of thought, I started a big push for
    localization for RegexKit 0.6.0.  This centered around adding "error:"
    arguments to methods so they could return standard NSError objects,
    and localizing all the strings RegexKit uses.  In addition to this, I
    re-wrote the text of all the PCRE error strings so they were up to
    "Aqua UI" standards while at the same time putting them in to .strings
    localized form.  I obviously can't localize the strings, but it should
    make it easier for anyone who does have to use RegexKit and requires
    localized regex error strings.  The goal is to make it a simple matter
    of handing off the returned NSError object to whatever alert display
    mechanism you want and get a high quality error dialog.  This is a
    work in progress right now, however, as it required updating a lot of
    the API to return NSError objects (which, in hindsight, I should have
    done in the first place.  Live and learn.)

    > Unfortunately I don't know of a comparable Cocoa regex library that
    > uses ICU. (NSPredicate does, but its support for regexes is very
    > limited, as already discussed in this thread.)

    RegexKit will likely include this in the future, though I won't
    promise the next release.  Right now I'm not happy with the hackery
    required to wedge it in to RegexKit.  For those interested, Oniguruma
    is probably right out.  It does not seem to be terribly multi-
    threading friendly, with the easiest solution to "giant lock" all
    calls to the library.  Not exactly compatible with the goals of
    RegexKit.
  • On 4 Mar '08, at 8:55 PM, John Engelhart wrote:

    > It's sort of ambiguous if the /usr/lib/libicucore library is
    > 'supported' or not.  I believe the general consensus is that it's
    > not really there for public use, hence the missing headers, but it's
    > also not verboten.

    Yeah, this is annoying. I don't know the reason for omitting the
    headers; Deborah Goldsmith would know (she's the ICU expert at Apple)
    but I don't know whether she reads this list.

    > The ICU Regex C API (the one I need to use for RegexKit, not the C++
    > one, which I haven't really looked at) is very multi-threading
    > unfriendly.  Basically, the 'compiled' regex, the string being
    > matched, and the current match state are all wrapped up in the same
    > opaque compiled regex pointer.

    Well, I'm pretty multi-threading unfriendly myself, so that hasn't
    been a concern for me ;-)
    But seriously, IIRC there is a way to cheaply clone an ICU regex
    object, so you can compile it once and peel off a new copy for every
    string you need to match. (I wrote, but never finished, a Cocoa ICU
    wrapper before I left Apple, and I think that was my solution to the
    state problem.)

    > RegexKit spends considerable effort in trying to get access to the
    > raw NSString buffer, to avoid unnecessary creation and destruction
    > of temporary buffers to perform a match.

    This is definitely a concern. I suspect this is the major reason there
    isn't an NSRegularExpression API yet; there's been talk of enhancing
    the ICU regex API to make it more flexible in how it accepts strings;
    but IMHO waiting for this is a case of "the best being the enemy of
    the good".

    > PCRE only works with UTF-8 encoded strings, while ICU only works in
    > UTF-16.  [...] most NSStrings buffers tend to be in a UTF-8
    > compatible form, allowing fast access by PCRE.  Using ICU would
    > require the creation of, and conversion to UTF-16 for most strings
    > (again, usage dependent), only to be released/freed right after use.

    I looked into this once. CFStrings (and NSStrings) are stored in one
    of two formats: (1) UTF-16, or (2) the "default C encoding". The
    latter varies by what your current locale is, but it defaults to ...
    MacRoman. [Yay for OS 9 compatibility! :P] This means that strings are
    *never* stored in UTF-8 form, at least not in English-speaking
    locales. (On the other hand, CFString is fairly smart about encodings,
    so if the string is all-ascii, it realizes that's compatible with
    UTF-8 and can return the raw buffer if you ask for UTF-8.)

    In my limited experiments, most strings I looked at were being stored
    in UTF-16. But it's heavily dependent on how the strings were created
    and what characters they contain, so YMMV.

    > For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
    > uses RegexKit as its regex matching engine.  This involves a list of
    > about 500 regexes (depending on which adblock lists you've
    > subscribed to) that need to be executed for every URL.

    Um, can't you merge those together into a single regex by joining them
    together with "or" operators? (That's a fairly typical trick that
    lexers use.)

    > My zero-order approximation read on the ICU vs. PCRE on this issue
    > leads me to think that they are essentially equal.  However, PCRE
    > and ICU define 'word' and 'non-word' (the regex escape sequence \w
    > and \W), and consequently the '(non-)word break' (escape sequence \b
    > and \B) very differently.  Specifically, PCRE defines word and non-
    > word in terms of ASCII encoding ONLY, whereas ICU does not

    What you're saying is that they're essentially equal, except for non-
    ascii characters :)

    ICU takes Unicode very, very seriously; that's its raison d'être. It's
    the International Components for Unicode. Regexes are just one of the
    things it does.

    > Translated to: A positive look-behind (the character just before
    > this point in the regex) must be a Unicode Character and a positive
    > look-ahead (the next character, without 'consuming' the input, must
    > not be a unicode character).  Definitely not as elegant, but I
    > suspect passable.

    Nope. As I said, several languages (including Japanese) have word-
    break rules that are more complex than this. Multiple words run
    together without any non-word characters in between. You have to use
    per-language heuristics to find the breaks. (My understanding is that
    Thai is especially nasty, practically requiring the use of a
    dictionary to tweeze apart the individual words.)

    And as I said, this isn't just hypothetical. It became a Priority 1,
    stop-the-presses bug for my project in 2005 as soon as the Japanese
    testers started trying out the functionality that used PCRE and
    discovered that it didn't work.

    —Jens
  • On Mar 5, 2008, at 1:03 AM, Jens Alfke wrote:
    >
    > On 4 Mar '08, at 8:55 PM, John Engelhart wrote:
    >>
    >> The ICU Regex C API (the one I need to use for RegexKit, not the C+
    >> + one, which I haven't really looked at) is very multi-threading
    >> unfriendly.  Basically, the 'compiled' regex, the string being
    >> matched, and the current match state are all wrapped up in the same
    >> opaque compiled regex pointer.
    >
    > Well, I'm pretty multi-threading unfriendly myself, so that hasn't
    > been a concern for me ;-)
    > But seriously, IIRC there is a way to cheaply clone an ICU regex
    > object, so you can compile it once and peel off a new copy for every
    > string you need to match. (I wrote, but never finished, a Cocoa ICU
    > wrapper before I left Apple, and I think that was my solution to the
    > state problem.)

    Yes, the uregex_clone() function.  The way I'm probably going to
    tackle it is to create the 'master' compiled regex and then each
    thread will demand populate its thread local cache by cloning the
    master one.  Not terribly pretty.

    >> For example, Safari AdBlock (http://safariadblock.sourceforge.net/)
    >> uses RegexKit as its regex matching engine.  This involves a list
    >> of about 500 regexes (depending on which adblock lists you've
    >> subscribed to) that need to be executed for every URL.
    >
    > Um, can't you merge those together into a single regex by joining
    > them together with "or" operators? (That's a fairly typical trick
    > that lexers use.)

    In theory, yes.  In practice, no.  It all comes down to the details of
    how the regex engine performs its matching.  For regex engines that
    use a NFA matching system, your suggestion is deathly slow.  If the
    regex engine uses a DFA matching system, then your suggestion is the
    right one, allowing essentially constant execution time (dependent on
    the length of the string to match) irregardless of the number of
    individual regexes you've joined together.  The older AT&T `lex` and
    the newer `flex` use a DFA matching system, but as a general rule of
    thumb general purpose regex matching engines use a NFA matching
    system.  It's a classic time/space/efficiency trade-off, the 'regex
    program' for a NFA system are easier to construct and much smaller
    than their DFA counterparts.  Most DFA matchers usually build an NFA
    system first, then perform the expensive NFA -> DFA conversion step.

    I did try what you suggested just to see how PCRE would respond
    (something like a componentsJoinedByString:@"|").  After about 10-15
    seconds of waiting, it was obvious that it was a solution that was not
    going to work.  :)

    >> My zero-order approximation read on the ICU vs. PCRE on this issue
    >> leads me to think that they are essentially equal.  However, PCRE
    >> and ICU define 'word' and 'non-word' (the regex escape sequence \w
    >> and \W), and consequently the '(non-)word break' (escape sequence
    >> \b and \B) very differently.  Specifically, PCRE defines word and
    >> non-word in terms of ASCII encoding ONLY, whereas ICU does not
    >
    > What you're saying is that they're essentially equal, except for non-
    > ascii characters :)
    >
    > ICU takes Unicode very, very seriously; that's its raison d'être.
    > It's the International Components for Unicode. Regexes are just one
    > of the things it does.
    >
    >> Translated to: A positive look-behind (the character just before
    >> this point in the regex) must be a Unicode Character and a positive
    >> look-ahead (the next character, without 'consuming' the input, must
    >> not be a unicode character).  Definitely not as elegant, but I
    >> suspect passable.
    >
    > Nope. As I said, several languages (including Japanese) have word-
    > break rules that are more complex than this. Multiple words run
    > together without any non-word characters in between. You have to use
    > per-language heuristics to find the breaks. (My understanding is
    > that Thai is especially nasty, practically requiring the use of a
    > dictionary to tweeze apart the individual words.)

    Ah, I did some digging, and I think you're referring to the the ICU
    UREGEX_UWORD regex compile time option for "enhanced Unicode word
    boundaries".  From the ICU header:

          UREGEX_UWORD            = 256  // Unicode word boundaries. If
    set, \b uses the Unicode TR 29 definition of word boundaries. Warning:
    Unicode word boundaries are quite different from traditional regular
    expression word boundaries.  See http://unicode.org/reports/tr29/#Word_Boundaries

    Even this enhanced functionality seems to be a "better, but still
    crude" approximation for true word breaking.  The ICU Regex
    documentation seems to recommend that if you want proper word
    breaking, you use the actual ICU word breaking system, and not
    regexes, precisely for the reasons you outlined above.  PCRE does not
    have this optional enhanced \b behavior, though I think the 'simpler,
    traditional' \b behavior can still be approximated with \p{L} like
    matching, but this would obviously fall far short of what's listed in
    the Unicode TR29 recommendation.  Looking at the recommendation,
    though, I can't help but think that all that extra Unicode property
    matching would make using the enhanced \b functionality a huge speed
    penalty.  Though I can easily see how it's in the "it doesn't matter
    how slow it is if you need it" variety of trade-offs.

    Obviously not a 'simple' problem.  :)

    > And as I said, this isn't just hypothetical. It became a Priority 1,
    > stop-the-presses bug for my project in 2005 as soon as the Japanese
    > testers started trying out the functionality that used PCRE and
    > discovered that it didn't work.

    Like I said, "it doesn't matter how slow it is if you need it".  :)  I
    suspect you could probably hand hack a regex to emulate the Unicode
    TR29 behavior in PCRE, but that would be one ugly regex.
previous month march 2008 next month
MTWTFSS
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            
Go to today