[ANN]: RegexKitLite 4.0

  • RegexKitLite 4.0 has been released.  Links:

    Download: http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.tar.bz2
    (139.1K)
    Documentation: http://regexkit.sourceforge.net/RegexKitLite/index.html
    PDF Documentation:
    http://downloads.sourceforge.net/regexkit/RegexKitLite-4.0.pdf (1.1M)

    On a personal note, please remember that RegexKitLite is open source
    software distributed under the BSD License.  This means that you are
    required to acknowledge your use of RegexKitLite in your application.

    There are an awful lot of "Top 10" applications that use RegexKitLite
    that don't acknowledge their use or RegexKitLite (or, at least, no
    acknowledgement can be easily found, even when one makes an effort to
    find it).  At this point, I really don't have a problem starting a
    "Public Shaming" list of applications and companies that are
    non-compliant with the licensing terms.  It's free.  It costs you
    nothing.  Many people have asked for an "exception" to strict
    compliance with the BSD License- to not include the complete text of
    the BSD license but just a simple, one line "acknowledgement" in their
    app, which I have granted.  I doubt many people would have a whole lot
    of sympathy for a company that has just hit the jackpot by being
    bought by a large "social network" company because of its iPhone app,
    only to have the app yanked from the App Store due to licensing
    non-compliance over free, open-source software, whose only "cost" is a
    simple acknowledgement.

    RegexKitLite 4.0 is a major release that includes new features, new
    APIs, and bug fixes.  Highlights of what's new in the 4.0 release:

    PDF Documentation

    Considerable effort was put in to creating high quality PDF
    documentation.  They style of the documentation is very similar to the
    Apple ADC produced documentation that you are already familiar with.
    Overall, I'm very happy with the final results.

    Improved Performance

    Previously, the regular expression cache worked on a "1-way set
    associative" cache, essentially '[regexString hash] % cacheSlots'.
    RegexKitLite 4.0 uses a 4-way set associative cache with a genuine
    least recently used replacement policy.  This means that strings with
    congruent hashes mod the number of cache slots (i.e., (hash)94 %
    (slots)7 == 3 && (hash)122 % (slots)7 == 3) are part of the same set
    and can now be placed in one of four "ways" in that set.  When a set
    is full, the way in that set that was the least recently used is the
    one that is picked to be ejected from the cache and filled with the
    new entry.  This change /dramatically/ increases the odds that a
    previously used regular expression will be found in the compiled
    regular expression cache.  The LRU algorithm chosen is also
    particularly clever- updating a "way" so that it is the most recently
    used is completely branchless and compiles to about a dozen
    instructions.  This means that on a modern out of order super scalar
    CPU, such as the Intel Core 2, true LRU tracking is probably "free",
    or very close to it, where "free" means there are no cycles used
    exclusively to perform LRU work.

    As to why it is important to have a cache of compiled regular
    expressions, benchmarking showed that compiling a regular expression
    took ~27,560 ns (on average, based on ~25 different regular
    expressions), where retrieving a cached compiled regular expression
    took ~51 ns.  The time to retrieve a cached compiled regular
    expression is constant, regardless of the regular expressions
    complexity or the time it takes to actually compile the regular
    expression.  This means that it is ~541.3 times faster to retrieve a
    cached compiled regular expression than it is to actually compile it.
    In terms of operations per second, this means it is possible to
    perform 19,680,762 cache look ups per second, compared to 36,285
    regular expression compiles per second.  Timing done on a MacBook Pro
    @ 2.66GHz Core 2 on 10.6.2.

    The ICU regular expression library also requires that the text to be
    search be encoded in UTF-16 format.  How a NSString (or CFString)
    chooses to encode the contents of the string is an "implementation
    issue", and normally not something a programmer should worry about.
    This is one of those times where it can make a difference.  The short
    version is this: Sometimes a Cocoa / Foundation / CoreFoundation
    string object uses UTF-16 for this purpose, in which case RegexKitLite
    takes advantages of that fact and uses that buffer directly- an
    obvious performance win.  On the other hand, some times it doesn't, so
    RegexKitLite needs to convert that buffer in to a UTF-16 encoded
    format.  RegexKitLite also caches these conversions, and RegexKitLite
    4.0 now uses the same LRU algorithm to manage these conversions as
    well.

    "Officially Supported" In iPhone OS >= 3.2

    iPhone OS 3.2 SDK introduced official support for ICU based regular
    expressions, and linking to the -libicucore ICU dynamic library for
    the purposes of uses its regular expression engine.  Many developers
    were concerned as to whether or not the use of RegexKitLite
    represented a violate of the iPhone SDK Agreement prohibiting the use
    of private or undocumented API's.  As of iPhone OS 3.2, this is no
    longer a concern.  See the RegexKitLite documentation for more
    information and links to Apples documentation about this change.

    While iPhone OS 3.2 introduced official support for ICU based regular
    expressions, at this time it is unclear if the iPhone 4.0 SDK
    Agreement "3.3.1" clause change prohibits the use of RegexKitLite in
    iPhone OS 4.0.  Technically, RegexKitLite is a "compatibility layer"
    between iPhoneOS/Cocoa/Foundation/NSString and the (now) "Documented
    APIs" of ICU regular expressions.

    ---- begin disclaimer ----
    It is not my intention to start a religious war on the merits of the
    3.3.1 change, or whether or not the change to 3.3.1 was intended to
    cover things like RegexKitLite.

    Anyone who believes that 3.3.1 does not cover RegexKitLite is simply
    kidding themselves.  I personally have no doubt what-so-ever that the
    wording of 3.3.1 covers RegexKitLite.  The wording of the clause does
    not make allowances for the fact that something like RegexKitLite is
    supplied to you in "source code format" that you compile, or that the
    source code is written in Objective-C.  The wording /is/ explicit that
    whatever is being considered need only act as a "compatibility layer"
    between the "documented APIs", which is exactly what RegexKitLite
    does. QED.  Note that this is a different interpretation of 3.3.1 than
    that taken by many other open source projects which take the position
    that because the source code, written in one of the approved
    languages, is supplied, then the 3.3.1 clause does not apply, or the
    definition of "compatibility layer" usually depends on some project
    biased interpretation.

    My intent is to be brutally honest with those developers who need to
    decide whether or not the risk of using RegexKitLite and the potential
    consequences of being grounds for rejection outweigh the gains and
    cost of having to rewrite / re-architect their application without
    RegexKitLite.  This is done under the "hope for the best, plan for the
    worst" philosophy- the worst case scenario is 3.3.1 covers
    RegexKitLite, and you should plan accordingly.  Caveat Emptor.

    Currently I have no hard, empirical evidence as to whether or not
    3.3.1 applies to RegexKitLite to help you, the developer, make a
    decision one way or the other.  I would appreciate reports on whether
    or not an application that uses RegexKitLite was accepted or rejected
    for any version of iPhone OS.  This information will be incorporated
    in to the documentation of later versions to help guide the decisions
    of future developers.  As it currently stands, I am not aware of a
    single iPhone application that has been rejected due to the use of
    RegexKitLite, but there are an awful lot of iPhone application that
    have been accepted- and this was prior to iPhone OS 3.2.
    ---- end disclaimer ----

    New Features

    RegexKitLite 4.0 now supports the new Blocks feature through a number
    of new methods (these represent the shorter "convenience" form of the
    methods):

    enumerateStringsMatchedByRegex:usingBlock:
    enumerateStringsSeparatedByRegex:usingBlock:
    stringByReplacingOccurrencesOfRegex:usingBlock:
    replaceOccurrencesOfRegex:usingBlock: (NSMutableString)

    The enumerateStringsMatchedByRegex: method invokes the passed ^block
    argument for each match, whereas enumerateStringsSeparatedByRegex:
    uses the supplied regular expression to perform a "split" operation on
    the string and invokes the supplied ^block argument with split
    results.

    The stringByReplacingOccurrencesOfRegex: and
    replaceOccurrencesOfRegex: methods perform a "search and replace"
    operation on the string, invoking the supplied ^block argument with
    details of each match, and replacing the characters matched with the
    contents of the NSString returned by the ^block.  This is very similar
    to the previous "search and replace" functionality that allowed you to
    create a replacement string using the contents of various capture
    groups using the "$n" syntax (i.e., @"First: $2, Last: $1", where $1
    and $2 are capture groups one and two from the regular expression).
    The difference is the Blocks-based replacement way allows complete
    control over the replacement text instead of the limited "fixed
    function" capability previously available.

    The following is an example of what's possible using the new
    Blocks-based replacement functionality.  A common problem when dealing
    with HTML text is dealing with "&NNN;" and "&xHHH;" encoded
    characters.  Using Blocks-based search and replace, we can match these
    sequences using a regular expression and replace them with the actual
    Unicode character that they represent:

    NSString *string = @"A test: é or é (0xe9 == LATIN SMALL
    LETTER E WITH ACUTE)\n"
                       @"Even >0xffff are handled: 𝐀 or 𝐀
    (0x1d400 == MATHEMATICAL BOLD CAPITAL A)";
    NSString *regex = @"&#(?:([0-9]+)|x([0-9a-fA-F]+));";

    NSString *replacedString = [string
    stringByReplacingOccurrencesOfRegex:regex usingBlock:^NSString
    *(NSInteger captureCount, NSString * const
    capturedStrings[captureCount], const NSRange
    capturedRanges[captureCount], volatile BOOL * const stop) {
      BOOL        hexValue        = (capturedRanges[1].location ==
    NSNotFound) ? YES : NO;
      NSString   *valueString     = (hexValue == NO) ? capturedStrings[1]
    : capturedStrings[2];
      const char *valueUTF8String = [valueString UTF8String];
      NSUInteger  u16Length       = 0UL, u32_ch = 0UL;
      unichar     u16Buffer[3];

      u32_ch = strtoul(valueUTF8String, NULL, (hexValue == NO) ? 10 : 16);

      if (u32_ch <= 0xFFFFU)       { u16Buffer[u16Length++] = ((u32_ch >=
    0xD800U) && (u32_ch <= 0xDFFFU)) ? 0xFFFDU : u32_ch; }
      else if (u32_ch > 0x10FFFFU) { u16Buffer[u16Length++] = 0xFFFDU; }
      else                         { u32_ch -= 0x0010000UL;
    u16Buffer[u16Length++] = ((u32_ch >> 10) + 0xD800U);
    u16Buffer[u16Length++] = ((u32_ch & 0x3FFUL) + 0xDC00U); }

      return([NSString stringWithCharacters:u16Buffer length:u16Length]);
    }];

    NSLog(@"string  :\n%@", string);
    NSLog(@"replaced:\n%@", replacedString);

    The output when run:

    2010-04-19 20:24:47.092 RegexKitLite[65827:a0f] string  :
    A test: &#233; or &#xe9; (0xe9 == LATIN SMALL LETTER E WITH ACUTE)
    Even >0xffff are handled: &#119808; or &#x1D400; (0x1d400 ==
    MATHEMATICAL BOLD CAPITAL A)
    2010-04-19 20:24:47.093 RegexKitLite[65827:a0f] replaced:
    A test: é or é (0xe9 == LATIN SMALL LETTER E WITH ACUTE)
    Even >0xffff are handled:
  • On Tue, 20 Apr 2010 15:45:13 -0400, John Engelhart
    <john.engelhart...> said:

    > There are an awful lot of "Top 10" applications that use RegexKitLite
    > that don't acknowledge their use

    "An awful lot"? Ex hypothesi and by definition, there must be 10 such
    applications or fewer...! m.

    --
    matt neuburg, phd = <matt...>, <http://www.tidbits.com/matt/>
    A fool + a tool + an autorelease pool = cool!
    AppleScript: the Definitive Guide - Second Edition!
    http://www.tidbits.com/matt/default.html#applescriptthings
  • On Apr 21, 2010, at 4:33 PM, Matt Neuburg wrote:

    >> There are an awful lot of "Top 10" applications that use RegexKitLite
    >> that don't acknowledge their use
    >
    > "An awful lot"? Ex hypothesi and by definition, there must be 10 such
    > applications or fewer...! m.

    Except that there can be many Top Ten lists (Mac Apps, iPad Apps,
    iPhone Apps, Productivity Apps) etc, and then different sites/people
    put different things on their lists.  You could pretty quickly get a
    list of unique programs much larger than 10 ;)

    -Chris Backas

    CONFIDENTIALITY NOTICE: This email (and any related attachments) contains information from InfoPlus (a service of Bristol Capital, Inc.).  It is intended only for the addressee and may contain information that is confidential and/or otherwise exempt from disclosure under applicable law. If you are not the intended recipient or are acting as agent for the intended recipient, any use or disclosure of this communication is prohibited. If you have received this communication in error, please notify me immediately to arrange for the appropriate method of returning or disposing of the communication. If our respective Companies have confidentiality provisions in effect, this email and the materials contained herein are deemed CONFIDENTIAL and should be treated accordingly unless expressly provided otherwise.
  • On Wed, Apr 21, 2010 at 4:33 PM, Matt Neuburg <matt...> wrote:

    > On Tue, 20 Apr 2010 15:45:13 -0400, John Engelhart
    > <john.engelhart...> said:
    >
    >> There are an awful lot of "Top 10" applications that use RegexKitLite
    >> that don't acknowledge their use
    >
    > "An awful lot"? Ex hypothesi and by definition, there must be 10 such
    > applications or fewer...! m.

    I don't think many people would be surprised that any given "Top 10" list
    must contain, by definition, 10 or fewer such applications "that use
    RegexKitLite that don't acknowledge their use".

    However, it does not necessarily follow that there can only be 10 or fewer
    '"Top 10" applications that use RegexKitLite that don't acknowledge their
    use'.  Just because all beaucoup fish are fish does not mean that all fish
    are beaucoup fish.  Your conclusion would true if, and only if, there was a
    single "Top 10" list.  It is assumed, without proving, that there is more
    than one "Top 10" list.

    There are an awful lot of "Top 10" applications that use RegexKitLite that
    don't acknowledge their use.  QED.

    > --
    > matt neuburg, phd = <matt...>, <http://www.tidbits.com/matt/>
    >

    "phd"?  Is that "University" with one or two "n"s?  :)
previous month april 2010 next month
MTWTFSS
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30    
Go to today