Regular Expressions?

  • Hi all,

    This might be a really silly question - but am I missing something obvious?
    Is there any support at all for regular expressions in the Cocoa libraries?

    I can't find anything and I've found some third-party frameworks - but
    surely something so necessary must be buried in the string classes
    somewhere? How would I do a simple substring search or replace in 10.4?

    Thanks,
    Cemil
  • On 6-Jun-08, at 4:31 AM, Cemil Browne wrote:

    > Hi all,
    >
    > This might be a really silly question - but am I missing something
    > obvious?
    > Is there any support at all for regular expressions in the Cocoa
    > libraries?
    >
    > I can't find anything and I've found some third-party frameworks - but
    > surely something so necessary must be buried in the string classes
    > somewhere? How would I do a simple substring search or replace in
    > 10.4?
    >
    > Thanks,
    > Cemil
    >

    Hello -

    There is no regular expression support in Cocoa.  You might find http://www.cocoadev.com/index.pl?RegularExpressions
      useful in helping to find a library to do it for you.

    Search and replace in Cocoa looks like this:

    NSString *someString = @"The quick brown fox";
    NSString *newString = [someString
    stringByReplacingOccurrencesOfString:@"quick" withString:@"slow"];

    You can find this in the NSString documentation.  It creates a new
    string with the substring replaced.

    -Bob Warwick
    <warwick...>
  • Hello -

    Whoops!  I can read.

    You use the replaceOccurrencesOfString:withString:options:range:
    method in NSMutableString.  It works on the same instance of the
    string instead of creating a new string.

    For example:

    NSMutableString *someString = [NSMutableString stringWithString:@"The
    quick brown fox"];
    [someString replaceOccurrencesOfString:@"quick" withString:@"slow"
    options:NSCaseInsensitiveSearch range:NSMakeRange(0, [someString
    length])];

    -Bob Warwick
    <warwick...>

    On 6-Jun-08, at 4:54 AM, Cemil Browne wrote:

    > Bob,
    >
    > Thanks for the reply...
    >
    > However, stringByReplacingOccurrencesOfString only works in Tiger,
    > though, right?  What did everyone do last year for it?
    >
    > -Cemil
    > On 06/06/2008, at 5:51 PM, Bob Warwick wrote:
    >
    >> Hello -
    >>
    >> There is no regular expression support in Cocoa.  You might find http://www.cocoadev.com/index.pl?RegularExpressions
    >> useful in helping to find a library to do it for you.
    >>
    >> Search and replace in Cocoa looks like this:
    >>
    >> NSString *someString = @"The quick brown fox";
    >> NSString *newString = [someString
    >> stringByReplacingOccurrencesOfString:@"quick" withString:@"slow"];
    >>
    >> You can find this in the NSString documentation.  It creates a new
    >> string with the substring replaced.
    >>
    >> -Bob Warwick
    >> <warwick...>
    >>
    >> On 6-Jun-08, at 4:31 AM, Cemil Browne wrote:
    >>
    >>> Hi all,
    >>>
    >>> This might be a really silly question - but am I missing something
    >>> obvious?
    >>> Is there any support at all for regular expressions in the Cocoa
    >>> libraries?
    >>>
    >>> I can't find anything and I've found some third-party frameworks -
    >>> but
    >>> surely something so necessary must be buried in the string classes
    >>> somewhere? How would I do a simple substring search or replace in
    >>> 10.4?
    >>>
    >>> Thanks,
    >>> Cemil
    >>
    >
  • Depending on what you're doing you could try using Ruby Cocoa.

    In theory this should give you access to all of Ruby's internal regexp
    support, combined with the GUI goodness of Cocoa.  However, this has
    limitations of its own, such as distribution audience, speed, etc.

    Dave

    On Jun 6, 2008, at 3:55 AM, Bob Warwick wrote:

    > On 6-Jun-08, at 4:31 AM, Cemil Browne wrote:
    >
    >> Hi all,
    >>
    >> This might be a really silly question - but am I missing something
    >> obvious?
    >> Is there any support at all for regular expressions in the Cocoa
    >> libraries?
    >>
    >> I can't find anything and I've found some third-party frameworks -
    >> but
    >> surely something so necessary must be buried in the string classes
    >> somewhere? How would I do a simple substring search or replace in
    >> 10.4?
    >>
    >> Thanks,
    >> Cemil
    >>
    >
    > Hello -
    >
    > There is no regular expression support in Cocoa.  You might find http://www.cocoadev.com/index.pl?RegularExpressions
    > useful in helping to find a library to do it for you.
    >
    > Search and replace in Cocoa looks like this:
    >
    > NSString *someString = @"The quick brown fox";
    > NSString *newString = [someString
    > stringByReplacingOccurrencesOfString:@"quick" withString:@"slow"];
    >
    > You can find this in the NSString documentation.  It creates a new
    > string with the substring replaced.
    >
    > -Bob Warwick
    > <warwick...>
  • On Jun 6, 2008, at 1:01 AM, David Troy wrote:
    > Depending on what you're doing you could try using Ruby Cocoa.
    >
    > In theory this should give you access to all of Ruby's internal
    > regexp support, combined with the GUI goodness of Cocoa.  However,
    > this has limitations of its own, such as distribution audience,
    > speed, etc.

    As big of a fan as I am of both RubyCocoa and PyObjC, I would never
    recommend either of them for use by someone relatively new to Cocoa
    (of which it sounds like the OP might be).

    Even with the awesome quality of the bridges -- and RubyCocoa / PyObjC
    are not the only ones -- there is still too much of an impedance
    mismatch between the two running environments for it to be considered
    easy to use.    A far more productive approach is to reduce the number
    of variables by gaining confidence / competence in pure Objective-C
    based Cocoa, then adding whatever scripting language you are familiar
    with into the mix after.

    b.bum
  • If you are not married to using regular expressions, NSScanner can do
    much the same in a more verbose (generally easier to read) way. I only
    mention this because it is often overlooked.

    On 6 Jun 2008, at 08:31, Cemil Browne wrote:

    > Hi all,
    >
    > This might be a really silly question - but am I missing something
    > obvious?
    > Is there any support at all for regular expressions in the Cocoa
    > libraries?
    >
    > I can't find anything and I've found some third-party frameworks - but
    > surely something so necessary must be buried in the string classes
    > somewhere? How would I do a simple substring search or replace in
    > 10.4?
    >
    > Thanks,
    > Cemil
    >
  • Thanks to everyone who replied - I appreciate the help.

    The best solution I've found (and been told) is to use a third party Regex
    library - http://regexkit.sourceforge.net/  appears to be decent.

    NSScanner does not really appear to do what I'm looking for - but is useful
    to know about regardless.

    Thanks all,

    -Cemil

    On Fri, Jun 6, 2008 at 6:11 PM, Citizen <citizen...> wrote:

    > If you are not married to using regular expressions, NSScanner can do much
    > the same in a more verbose (generally easier to read) way. I only mention
    > this because it is often overlooked.
    >
    >
    > On 6 Jun 2008, at 08:31, Cemil Browne wrote:
    >
    > Hi all,
    >>
    >> This might be a really silly question - but am I missing something
    >> obvious?
    >> Is there any support at all for regular expressions in the Cocoa
    >> libraries?
    >>
    >> I can't find anything and I've found some third-party frameworks - but
    >> surely something so necessary must be buried in the string classes
    >> somewhere? How would I do a simple substring search or replace in 10.4?
    >>
    >> Thanks,
    >> Cemil
    >>
    >>

    >
  • On Fri, Jun 6, 2008 at 8:31 AM, Cemil Browne <cemilb+<cocoadev...> wrote:

    > This might be a really silly question - but am I missing something obvious?
    > Is there any support at all for regular expressions in the Cocoa libraries?

    You can use NSPredicate for regexp matching, though no substitution is possible:

    NSPredicate *pred = [NSPredicate predicateWithFormat:@"self matches
    %@", yourRegexp];
    BOOL result = [pred evaluateWithObject:yourString];

    Hamish
  • On Friday, June 06, 2008, at 10:24AM, <cocoa-dev-request...> wrote:
    > As big of a fan as I am of both RubyCocoa and PyObjC, I would never
    > recommend either of them for use by someone relatively new to Cocoa
    > (of which it sounds like the OP might be).
    >
    > Even with the awesome quality of the bridges -- and RubyCocoa / PyObjC
    > are not the only ones -- there is still too much of an impedance
    > mismatch between the two running environments for it to be considered
    > easy to use.    A far more productive approach is to reduce the number
    > of variables by gaining confidence / competence in pure Objective-C
    > based Cocoa, then adding whatever scripting language you are familiar
    > with into the mix after.
    >
    > b.bum

    As someone that is actually doing precisely that (learning Cocoa by using RubyCocoa), I'd have to disagree.  The 'impedence mismatch' is really quite small (so far, I have only hit one nasty case), and it means that you don't have to fully learn Objective C's syntax at the same time as Cocoa. Obviously you do have to learn enough ObjC syntax to be able to read the documentation, but that's a far easier task than learning to write ObjC code (and I say that as someone that writes pure C code for a day-job).

    Anyway, I guess it comes down to personal preferences.  But I don't think the problems of the scripting bridges are significant enough to justify a blanket 'don't do this if you're a beginner'.

    Alli
  • Perhaps also consider RegexKitLite, which is written by the same
    author. The difference is it links to shared libicu thats already
    distributed in the os. No need to embed some specific version of PCRE
    library into your app included with the regexkit (saves ~1.6mb in the
    bundle). Also the icu library is something apple uses (and unlikely to
    go away in future releases of OS X).

    On 6 Jun 2008, at 09:22, Cemil Browne wrote:

    > Thanks to everyone who replied - I appreciate the help.
    >
    > The best solution I've found (and been told) is to use a third party
    > Regex
    > library - http://regexkit.sourceforge.net/  appears to be decent.
    >
    > NSScanner does not really appear to do what I'm looking for - but is
    > useful
    > to know about regardless.
    >
    > Thanks all,
    >
    > -Cemil
    >
    > On Fri, Jun 6, 2008 at 6:11 PM, Citizen <citizen...> wrote:
    >
    >> If you are not married to using regular expressions, NSScanner can
    >> do much
    >> the same in a more verbose (generally easier to read) way. I only
    >> mention
    >> this because it is often overlooked.
    >>
    >>
    >> On 6 Jun 2008, at 08:31, Cemil Browne wrote:
    >>
    >> Hi all,
    >>>
    >>> This might be a really silly question - but am I missing something
    >>> obvious?
    >>> Is there any support at all for regular expressions in the Cocoa
    >>> libraries?
    >>>
    >>> I can't find anything and I've found some third-party frameworks -
    >>> but
    >>> surely something so necessary must be buried in the string classes
    >>> somewhere? How would I do a simple substring search or replace in
    >>> 10.4?
    >>>
    >>> Thanks,
    >>> Cemil
    >>>
    >>>

    >>>
    >>

  • But RegexKitLite does not support substitution, does it?
    Regex pattern matching is one thing, regex string substitution another.

    On Jun 6, 2008, at 11:34 AM, dream cat7 wrote:

    >
    > Perhaps also consider RegexKitLite, which is written by the same
    > author. The difference is it links to shared libicu thats already
    > distributed in the os. No need to embed some specific version of
    > PCRE library into your app included with the regexkit (saves ~1.6mb
    > in the bundle). Also the icu library is something apple uses (and
    > unlikely to go away in future releases of OS X).
    >
    >
    > On 6 Jun 2008, at 09:22, Cemil Browne wrote:
    >
    >> Thanks to everyone who replied - I appreciate the help.
    >>
    >> The best solution I've found (and been told) is to use a third
    >> party Regex
    >> library - http://regexkit.sourceforge.net/  appears to be decent.
    >>
    >> NSScanner does not really appear to do what I'm looking for - but
    >> is useful
    >> to know about regardless.
    >>
    >> Thanks all,
    >>
    >> -Cemil
    >>
    >> On Fri, Jun 6, 2008 at 6:11 PM, Citizen <citizen...>
    >> wrote:
    >>
    >>> If you are not married to using regular expressions, NSScanner can
    >>> do much
    >>> the same in a more verbose (generally easier to read) way. I only
    >>> mention
    >>> this because it is often overlooked.
    >>>
    >>>
    >>> On 6 Jun 2008, at 08:31, Cemil Browne wrote:
    >>>
    >>> Hi all,
    >>>>
    >>>> This might be a really silly question - but am I missing something
    >>>> obvious?
    >>>> Is there any support at all for regular expressions in the Cocoa
    >>>> libraries?
    >>>>
    >>>> I can't find anything and I've found some third-party frameworks
    >>>> - but
    >>>> surely something so necessary must be buried in the string classes
    >>>> somewhere? How would I do a simple substring search or replace in
    >>>> 10.4?
    >>>>
    >>>> Thanks,
    >>>> Cemil
    >>>>
    >>>>

    >>>>
    >>>


  • Hi,

    You've gotten a lot of decent answers so far.

    As a long time UNIX programmer, I'll suggest looking into the regexp
    library that already comes with OS X.

    man regcomp on the command line to find out how to use.

    I've used it for years in my C applications on UNIX and UNIX-like
    operating systems. I've even made some simple classes for using the lib
    in Cocoa. Admittedly the interface is a bit schizoid, but I'll email it
    to you if you're interested in looking at it.

    I've looked at RegexKit and it looks good if you want Perl compatible
    regular expressions. The license isn't bad, so you can add it to your
    application without doing much, though giving a plug to the author(s) in
    your About panel would be nice. ;)

    I'll have to add making a nice interface for the system regex library to
    my list of things to do. Maybe I'll get to that this weekend right after
    doing bas64, UUcode, binHex, etc. ;)

    Cheers,
    Jason
  • No that would require finding rangeOfRegex followed by a call to
    replaceCharactersInRange

    NSRange range = [theString rangeOfRegex:@"regex" capture:0];

    if( ! NSEqualRanges(range, ((NSRange){NSNotFound, 0} )) )
      [theString replaceCharactersInRange:range withString:@"newstring"];

      compare that to the RegexKit equivalent which is

    - (NSString *)stringByMatching:(id)aRegex replace:(const
    RKUInteger)count withReferenceFormat:(NSString *
    const)referenceFormatString, ...;

    On 6 Jun 2008, at 11:01, Vincent E. wrote:

    > But RegexKitLite does not support substitution, does it?
    > Regex pattern matching is one thing, regex string substitution
    > another.
    >
    > On Jun 6, 2008, at 11:34 AM, dream cat7 wrote:
    >
    >>
    >> Perhaps also consider RegexKitLite, which is written by the same
    >> author. The difference is it links to shared libicu thats already
    >> distributed in the os. No need to embed some specific version of
    >> PCRE library into your app included with the regexkit (saves ~1.6mb
    >> in the bundle). Also the icu library is something apple uses (and
    >> unlikely to go away in future releases of OS X).
    >>
    >>
    >> On 6 Jun 2008, at 09:22, Cemil Browne wrote:
    >>
    >>> Thanks to everyone who replied - I appreciate the help.
    >>>
    >>> The best solution I've found (and been told) is to use a third
    >>> party Regex
    >>> library - http://regexkit.sourceforge.net/  appears to be decent.
    >>>
    >>> NSScanner does not really appear to do what I'm looking for - but
    >>> is useful
    >>> to know about regardless.
    >>>
    >>> Thanks all,
    >>>
    >>> -Cemil
    >>>
    >>> On Fri, Jun 6, 2008 at 6:11 PM, Citizen <citizen...>
    >>> wrote:
    >>>
    >>>> If you are not married to using regular expressions, NSScanner
    >>>> can do much
    >>>> the same in a more verbose (generally easier to read) way. I only
    >>>> mention
    >>>> this because it is often overlooked.
    >>>>
    >>>>
    >>>> On 6 Jun 2008, at 08:31, Cemil Browne wrote:
    >>>>
    >>>> Hi all,
    >>>>>
    >>>>> This might be a really silly question - but am I missing something
    >>>>> obvious?
    >>>>> Is there any support at all for regular expressions in the Cocoa
    >>>>> libraries?
    >>>>>
    >>>>> I can't find anything and I've found some third-party frameworks
    >>>>> - but
    >>>>> surely something so necessary must be buried in the string classes
    >>>>> somewhere? How would I do a simple substring search or replace
    >>>>> in 10.4?
    >>>>>
    >>>>> Thanks,
    >>>>> Cemil
    >>>>>
    >>>>>

    >>>>>
    >>>>


    >
  • Right, but that's a very trivial string replacement with no advanced
    modifications.

    I had thing like this perl script for changing case to "word caps" in
    mind:
    echo 'some test text' | perl -pe 's/\b(.*?)/\u\L$1/g'

    search pattern would be "\b(.*?)"
    replacement pattern would be "\u\L$1"

    I would need something with capabilities like these.

    On Jun 6, 2008, at 2:09 PM, dream cat7 wrote:

    >
    > No that would require finding rangeOfRegex followed by a call to
    > replaceCharactersInRange
    >
    > NSRange range = [theString rangeOfRegex:@"regex" capture:0];
    >
    > if( ! NSEqualRanges(range, ((NSRange){NSNotFound, 0} )) )
    > [theString replaceCharactersInRange:range withString:@"newstring"];
    >
    >
    > compare that to the RegexKit equivalent which is
    >
    > - (NSString *)stringByMatching:(id)aRegex replace:(const
    > RKUInteger)count withReferenceFormat:(NSString *
    > const)referenceFormatString, ...;
  • I've used OgreKit before and found it's worked pretty well.

    http://www8.ocn.ne.jp/%7esonoisa/OgreKit/index.html

    Dave

    On Fri, Jun 6, 2008 at 1:31 AM, Cemil Browne <cemilb+<cocoadev...> wrote:
    > Hi all,
    >
    > This might be a really silly question - but am I missing something obvious?
    > Is there any support at all for regular expressions in the Cocoa libraries?
    >
    > I can't find anything and I've found some third-party frameworks - but
    > surely something so necessary must be buried in the string classes
    > somewhere? How would I do a simple substring search or replace in 10.4?
    >
    > Thanks,
    > Cemil
  • I agree that to be able to use that syntax is highly desirable, and
    indeed missing from all the cocoa libraries that I have looked at. One
    way would be a category addition to NSString class, which would call
    the  perl -pe 's/\b(.*?)/\u\L$1/g' for you and return the result as an
    NSString...

    But unfortunately nobody has come up that type of an extension to
    NSString yet.
    Or are there and c-libraries that will accept this s//// syntax ?

    On 6 Jun 2008, at 14:39, Vincent E. wrote:

    > Right, but that's a very trivial string replacement with no advanced
    > modifications.
    >
    > I had thing like this perl script for changing case to "word caps"
    > in mind:
    > echo 'some test text' | perl -pe 's/\b(.*?)/\u\L$1/g'
    >
    > search pattern would be "\b(.*?)"
    > replacement pattern would be "\u\L$1"
    >
    > I would need something with capabilities like these.
    >
    > On Jun 6, 2008, at 2:09 PM, dream cat7 wrote:
    >
    >>
    >> No that would require finding rangeOfRegex followed by a call to
    >> replaceCharactersInRange
    >>
    >> NSRange range = [theString rangeOfRegex:@"regex" capture:0];
    >>
    >> if( ! NSEqualRanges(range, ((NSRange){NSNotFound, 0} )) )
    >> [theString replaceCharactersInRange:range withString:@"newstring"];
    >>
    >>
    >> compare that to the RegexKit equivalent which is
    >>
    >> - (NSString *)stringByMatching:(id)aRegex replace:(const
    >> RKUInteger)count withReferenceFormat:(NSString *
    >> const)referenceFormatString, ...;
  • dream cat7 wrote:
    >
    > I agree that to be able to use that syntax is highly desirable, and
    > indeed missing from all the cocoa libraries that I have looked at. One
    > way would be a category addition to NSString class, which would call
    > the  perl -pe 's/\b(.*?)/\u\L$1/g' for you and return the result as an
    > NSString...
    >
    > But unfortunately nobody has come up that type of an extension to
    > NSString yet.
    > Or are there and c-libraries that will accept this s//// syntax ?

    Would something like this be acceptable?

    [someString replacePattern: @"\\b(.*?)" with: @"\\u\\L$1" flags: @"g"];

    It is pretty much how I have considered handling substitions in the
    category that I'm designing.

    As for string subs in C libraries, I really am only familiar with using
    the POSIX standard regex library. You can do that sort of thing in it,
    but it is not as simple as writing a single line of code.

    I have some limited experience with PCRE, and IIRC the coding is more
    less the same there.

    I don't know of any regex library that will accept s//// in a string.
    That's a bit of syntactic sugar that makes Perl so sweet. ;)

    Cheers,
    Jason
  • Replying to myself here, which I know is generally a bad thing, but this
    thought just came to me.

    I have yet to find a regex library that handles UTF-16 well, if at all.
    I actually spent a couple of hours yesterday trying to mangle some
    UTF-16 files in Perl using regular expressions. I gave up and did it in
    Emacs, the only environment where I've seen REs handle UTF16 properly.

    So, that's now my mission, to come up with a RE library that handles
    UTF16 as gracefully as 7 bit ASCII.

    Cheers,
    Jason
  • On 6 Jun '08, at 3:23 AM, Jason Stephenson wrote:

    > As a long time UNIX programmer, I'll suggest looking into the regexp
    > library that already comes with OS X.
    > man regcomp on the command line to find out how to use.

    It doesn't look as though this library is Unicode-aware. The strings
    it takes are C string (char*) with no indication of what encoding is
    used, and Unicode or UTF-8 aren't mentioned in the man page. From
    that, I'd guess that this library only works with single-byte
    encodings (like ISO-Latin-1 or CP-1252, not UTF-8 or the various non-
    Roman encodings) and that it will treat all non-ascii characters as
    being not spaces and not letters.

    In short, I think it only works correctly with plain ascii. IMHO
    that's much too limited for most purposes nowadays. Even if you don't
    touch user-visible text with it, it's still pretty common to find non-
    ascii characters in HTML, XML, even source code.

    Of the regex libraries mentioned so far, I recommend RegexKitLite.
    It's based on ICU, which is Unicode-savvy, already built into the OS,
    and used by lots of Apple apps.

    —Jens
  • On Jun 6, 2008, at 5:23 AM, Jason Stephenson wrote:

    > Hi,
    >
    > You've gotten a lot of decent answers so far.
    >
    > As a long time UNIX programmer, I'll suggest looking into the regexp
    > library that already comes with OS X.
    >
    > man regcomp on the command line to find out how to use.

    Note that NSStrings are usually internally stored as UTF-16, and
    regcomp requires a "char *", so at the very least, you'll need to
    convert the NSString to UTF-8, which can be expensive (in terms of
    having to make a large copy of a potentially very large string and
    walk through before doing any regex work on it).

    Worse, once converted to UTF8, it's not documented that regcomp works
    correctly for any UTF-8 other than ASCII.

    Even worse, converting from an index in a UTF-8 string back to the
    corresponding index in the original NSString is also problematic - you
    basically have to walk through the UTF-8 string, counting code points
    (which count double for surrogate pairs).

    As a result, using regcomp works OK for shorter strings that are pure
    ASCII to start with, but longer string or non-ASCII characters start
    to increase the problem...

    One other possible solution is to use the JavaScriptCore and make a
    JSStringRef (which works with unichars like NSString), and use
    JavaScript's regex support - that way the results will at least have
    consistent indices, work well with non-ASCII characters, etc...

    Glenn Andreas                      <gandreas...>
      <http://www.gandreas.com/> wicked fun!
    JSKit | the easy way to unite JavaScript and Objective C
  • glenn andreas wrote:

    [wrote about how using regex is not a good idea, particularly with
    NSString and unicode. Pretty much the same things that Jens wrote earlier.]

    Yes, that's all very true. Regex is a poor choice if you're working on
    non-ASCII text. I'm generally not doing so, but just yesterday did have
    the unpleasant experience of regexing some UTF16 files. (See another
    email by me in this thread.)

    You could kludge it to work using some options that are available on Mac
    OS X and FreeBSD regex libraries. (Don't know if it is available
    elsewhere, but likely is.) Essentially, you tell regcomp to ignore nuls
    and then you have a lot of fun coding REs that match your UTF16 strings
    taking into account endianness and all. I've pondered how it would work
    and am confident that it would work, but also concede that it would be a
    very ugly hack and be prone to breakage.

    > One other possible solution is to use the JavaScriptCore and make a
    > JSStringRef (which works with unichars like NSString), and use
    > JavaScript's regex support - that way the results will at least have
    > consistent indices, work well with non-ASCII characters, etc...

    That is an excellent option if you're using JavaScriptCore already, or
    maybe even if you're not. There's another thing to look into. Anyone for
    a unicode text editor that is scriptable in JavaScript? (Hmm, maybe the
    world really doesn't need another text editor.) :P

    For now, I'm going to look into ICU. I seem to have a couple of copies
    of it on my computer.

    Cheers,
    Jason
  • On 6 Jun '08, at 8:02 AM, Jason Stephenson wrote:

    > I have yet to find a regex library that handles UTF-16 well, if at
    > all. I actually spent a couple of hours yesterday trying to mangle
    > some UTF-16 files in Perl using regular expressions. I gave up and
    > did it in Emacs, the only environment where I've seen REs handle
    > UTF16 properly.

    ICU is entirely UTF-16 based.

    —Jens
  • ... pcre takes utf8 strings
    ... utf-16 is supported by RegexKitLite & lib ICU

    ... NSString and CFString are implemented as utf-16

    On 6 Jun 2008, at 16:02, Jason Stephenson wrote:

    > Replying to myself here, which I know is generally a bad thing, but
    > this thought just came to me.
    >
    > I have yet to find a regex library that handles UTF-16 well, if at
    > all. I actually spent a couple of hours yesterday trying to mangle
    > some UTF-16 files in Perl using regular expressions. I gave up and
    > did it in Emacs, the only environment where I've seen REs handle
    > UTF16 properly.
    >
    > So, that's now my mission, to come up with a RE library that handles
    > UTF16 as gracefully as 7 bit ASCII.
    >
    > Cheers,
    > Jason
  • > dream cat7 wrote:
    >>
    >> I agree that to be able to use that syntax is highly desirable, and
    >> indeed missing from all the cocoa libraries that I have looked at. One
    >> way would be a category addition to NSString class, which would call
    >> the  perl -pe 's/\b(.*?)/\u\L$1/g' for you and return the result as an
    >> NSString...
    >>
    >> But unfortunately nobody has come up that type of an extension to
    >> NSString yet.
    >> Or are there and c-libraries that will accept this s//// syntax ?

    Well, Terry Jones wrote a routine called "strsed" in 1990 that in
    fact does this:

    http://www.jones.tc/personal/odds-and-ends.html

    It was loosely modeled on sed(1). It's a C routine that has this format:

    str = strsed(string, command, option);

    The strings is something like "Hello world". The command is of the form:

    s/search/replace/ or g/search/replace/

    The "s" version finds the first match and replaces it. The "g"
    version finds and replaces every occurrence. It uses the regex(3)
    library.

    That's the good part. The bad part is that uses the older simpler
    regex capability. I'm in the process of modernizing this code to use
    the POSIX mode of regex(3), and hope to have it done next week. I
    will eventually put it on my not-yet-done web site and support it
    (maybe a small ObjC class for use with the icu library too).

    In the mean time, email me if you want version 1 (couple of days). It
    will be released with a very permissive license as the base code was
    totally open source.

    David
  • When I mentioned "perl -pe 's/\b(.*?)/\u\L$1/g'" I actually wasn't
    asking for any ObjC method with a look-alike syntax.
    I actually wouldn't give a damn about "how" ("s///g") to pass a regex
    pattern to a method. ;)

    I was rather asking whether RegExKit (or even RegExKitLite?) would
    generally be able to perform RegEx driven string replacements
    where the replacement string contains stuff like "match back-
    references" (\1, \2, \<named>, …) or string modificators like "\L, \U".

    Now to answer my own question:

    RegExKit has this function which (according to the documentation)
    seems to do just what I was looking for:
    [subjectString stringByMatching:regexString
    withReferenceString:templateString];

    And for the latter (\L, \U, etc) I unfortunately had to find this in
    the PCRE documentation:
    5.    The  following Perl escape sequences are not supported: \l, \u,
    \L,
            \U, and \N. In fact these are implemented by Perl's general
    string-han-
            dling  and are not part of its pattern matching engine. If any
    of these
            are encountered by PCRE, an error is generated.
    http://www.pcre.org/pcre.txt

    Thanks a lot any way.
    Vincent

    On Jun 6, 2008, at 4:45 PM, Jason Stephenson wrote:
    > Would something like this be acceptable?
    >
    > [someString replacePattern: @"\\b(.*?)" with: @"\\u\\L$1" flags:
    > @"g"];

    On Jun 6, 2008, at 6:16 PM, David Hoerl wrote:

    >> dream cat7 wrote:
    >>>
    >>> I agree that to be able to use that syntax is highly desirable, and
    >>> indeed missing from all the cocoa libraries that I have looked at.
    >>> One
    >>> way would be a category addition to NSString class, which would call
    >>> the  perl -pe 's/\b(.*?)/\u\L$1/g' for you and return the result
    >>> as an
    >>> NSString...
    >>>
    >>> But unfortunately nobody has come up that type of an extension to
    >>> NSString yet.
    >>> Or are there and c-libraries that will accept this s//// syntax ?
  • On 6 Jun 2008, at 08:03, Jens Alfke wrote:
    >
    > On 6 Jun '08, at 3:23 AM, Jason Stephenson wrote:
    >
    >> As a long time UNIX programmer, I'll suggest looking into the
    >> regexp library that already comes with OS X.
    >> man regcomp on the command line to find out how to use.
    >
    > It doesn't look as though this library is Unicode-aware. The strings
    > it takes are C string (char*) with no indication of what encoding is
    > used, and Unicode or UTF-8 aren't mentioned in the man page. From
    > that, I'd guess that this library only works with single-byte
    > encodings (like ISO-Latin-1 or CP-1252, not UTF-8 or the various non-
    > Roman encodings) and that it will treat all non-ascii characters as
    > being not spaces and not letters.
    >
    > In short, I think it only works correctly with plain ascii. IMHO
    > that's much too limited for most purposes nowadays. Even if you
    > don't touch user-visible text with it, it's still pretty common to
    > find non-ascii characters in HTML, XML, even source code.
    >
    > Of the regex libraries mentioned so far, I recommend RegexKitLite.
    > It's based on ICU, which is Unicode-savvy, already built into the
    > OS, and used by lots of Apple apps.

    You are correct, but in my casual usage, feeding UTF-8 to the POSIX
    regex routines works just fine if you take into account that the
    defined character classes are ASCII-aware only, and are aware that the
    results you get back are byte offsets, not character offsets - i.e.
    don't convert them to NSRanges and expect them to be correct against
    the NSString you got the UTF-8 from (similar caveats apply to match
    counts etc. - i.e. ".{3}" will happily match two characters if they
    take up three bytes).

    I wouldn't want to present the regexes to the user, of course, but for
    pre-defined regexes in code, it's okay (not great with those caveats
    obviously, but alright).

    My main complaint about it is that it's /extremely slow/ compared to
    most modern regex libraries, but for casual usage, you at least don't
    have to link any extra libraries to use it.

    I do think that good regex additions to NSString, or an NSRegex class,
    are highly overdue in Cocoa.

    Jamie.
  • On 6 Jun '08, at 8:13 AM, glenn andreas wrote:

    > One other possible solution is to use the JavaScriptCore and make a
    > JSStringRef (which works with unichars like NSString), and use
    > JavaScript's regex support - that way the results will at least have
    > consistent indices, work well with non-ASCII characters, etc...

    JavaScriptCore is just using PCRE*. That basically supports Unicode,
    but I have had problems in the past with non-Roman text in JS regular
    expressions (particularly with word breaks in Japanese text). I think
    ICU is a better bet.

    —Jens

    * and until a month or two ago it was a very ancient version of PCRE,
    with prominent security holes; that's how that guy won a MacBook Pro
    in the "Pwn2Own" contest.
  • On Fri, Jun 6, 2008 at 10:13 AM, glenn andreas <gandreas...> wrote:
    > One other possible solution is to use the JavaScriptCore and make a
    > JSStringRef (which works with unichars like NSString), and use JavaScript's
    > regex support - that way the results will at least have consistent indices,
    > work well with non-ASCII characters, etc...

    As long as we're discussing really far out solutions...

    There's the Java java.util.regex package. One could write some JNI and
    wrap the Java package (only two classes) although you'd have to start
    up the VM.

    Personally, I like some of the extensions that Java adds to the patterns.
  • On Jun 6, 2008, at 2:10 AM, Allison Newman wrote:

    > you don't have to fully learn Objective C's syntax at the same time
    > as Cocoa.

    Ok, all kinds of alarm bells just went off.  Obj-C is a very small
    delta from C, and if you avoid learning it, you will regret it.

    -jcr
  • What I found so useful about Cocoa-Java was that it was the perfect
    tool for easily writing Cocoa Apps that made heavy use of technologies
    that Apple was too short-sighted to add, largely because Java came out-
    of-the-box with so many useful classes for basic stuff like regular
    expressions. And its integration with the rest of the environment made
    it far less painful than the other language bindings. Back in the
    Jaguar-era when I had to write applications that made heavy use of XML
    and regular expressions, Cocoa-Java saved the day--no 3rd-party
    nonsense required.

    Those were the days...

    -- Ilan

    On Jun 6, 2008, at 5:47 PM, Stephen J. Butler wrote:

    > On Fri, Jun 6, 2008 at 10:13 AM, glenn andreas <gandreas...>
    > wrote:
    >> One other possible solution is to use the JavaScriptCore and make a
    >> JSStringRef (which works with unichars like NSString), and use
    >> JavaScript's
    >> regex support - that way the results will at least have consistent
    >> indices,
    >> work well with non-ASCII characters, etc...
    >
    > As long as we're discussing really far out solutions...
    >
    > There's the Java java.util.regex package. One could write some JNI and
    > wrap the Java package (only two classes) although you'd have to start
    > up the VM.
    >
    > Personally, I like some of the extensions that Java adds to the
    > patterns.
  • Ilan Volow wrote:
    > Back in the
    > Jaguar-era when I had to write applications that made heavy use of XML
    > and regular expressions, Cocoa-Java saved the day--no 3rd-party nonsense
    > required.

    This in not a knock on Ilan. His mail just happens to embody an attitude
    that I see quite frequently on this list, and I just feel that I have to
    share my puzzlement at this negative attitude toward 3rd party frameworks.

    It seems that many on this list feel that Apple should provide
    everything that the programmer needs to work on Mac OS X and that there
    should not be 3rd party frameworks for much of anything.

    This attitude really, truly puzzles me because on every other platform
    where I've programmed this attitude never came up in the discussion
    forums. It was always just assumed that you would need to use 3rd party
    frameworks to get any real work done, unless you intended to roll
    everything yourself.

    If you look at programming for Linux or any of the BSDs, you will
    definitely need to install frameworks from 3rd parties to do any GUI
    programming at all, or really any programming. After all, gcc is not
    produced by any of the major distributors or developers of Linux or BSD.
    Heck, even on the Mac, most of the programming frameworks are based on
    3rd party frameworks underneath.

    The same is true for Perl where many applications have a list of 3rd
    party module dependencies that make Amy Winehouse look clean and sober. ;)

    The only other environment where I've programmed that this same attitude
    may rear its head could be Java land, but even there that attitude does
    not seem to rear its head quite so often as it seems to on this list.

    As someone who has worked on a number of 3rd party [open source and
    otherwise] frameworks, I wonder where this attitude comes from in the
    case of Cocoa/Mac OS X. I have some ideas, but I hesitate to share them.

    Puzzled,
    Jason
  • In math, a result is 'elegant' if it just _does_ something, simply and
    quickly, rather than relying on a mass of machinery done elsewhere,
    that you either have to assume works or spend time understanding.  A
    large dependency can make it harder to say what, exactly, are the key
    lynchpins that make the result happen.

    I think code is similar.  Fewer dependencies is simpler, more elegant.
    Dropping in a chunk of code from elsewhere is not a huge deal.  A
    class, a bit more so, a framework, more so, a framework with new
    overarching patterns or a new language, more so, etc.

    This goes further than 1st party vs 3rd party.  You'll see people want
    to write in 'pure cocoa' or 'pure cocoa bindings'.

    The tradeoff is that when Cocoa (or whatever) _doesn't_ do something
    particularly well, using something else can easily make your code
    enough simpler that it offsets the complexity of using the new thing.

    But I think it's perfectly reasonable to use the new thing and then
    grouse about how Apple should make it unnecessary. :-)

    -Ken

    On Sat, Jun 7, 2008 at 7:43 AM, Jason Stephenson <jason...> wrote:
    > Ilan Volow wrote:
    >>
    >> Back in the
    >> Jaguar-era when I had to write applications that made heavy use of XML and
    >> regular expressions, Cocoa-Java saved the day--no 3rd-party nonsense
    >> required.
    >
    > This in not a knock on Ilan. His mail just happens to embody an attitude
    > that I see quite frequently on this list, and I just feel that I have to
    > share my puzzlement at this negative attitude toward 3rd party frameworks.
    >
    > It seems that many on this list feel that Apple should provide everything
    > that the programmer needs to work on Mac OS X and that there should not be
    > 3rd party frameworks for much of anything.
    >
    > This attitude really, truly puzzles me because on every other platform where
    > I've programmed this attitude never came up in the discussion forums. It was
    > always just assumed that you would need to use 3rd party frameworks to get
    > any real work done, unless you intended to roll everything yourself.
    >
    > If you look at programming for Linux or any of the BSDs, you will definitely
    > need to install frameworks from 3rd parties to do any GUI programming at
    > all, or really any programming. After all, gcc is not produced by any of the
    > major distributors or developers of Linux or BSD. Heck, even on the Mac,
    > most of the programming frameworks are based on 3rd party frameworks
    > underneath.
    >
    > The same is true for Perl where many applications have a list of 3rd party
    > module dependencies that make Amy Winehouse look clean and sober. ;)
    >
    > The only other environment where I've programmed that this same attitude may
    > rear its head could be Java land, but even there that attitude does not seem
    > to rear its head quite so often as it seems to on this list.
    >
    > As someone who has worked on a number of 3rd party [open source and
    > otherwise] frameworks, I wonder where this attitude comes from in the case
    > of Cocoa/Mac OS X. I have some ideas, but I hesitate to share them.
    >
    > Puzzled,
    > Jason
    >
  • On Jun 6, 2008, at 1:27 PM, Vincent E. wrote:

    > When I mentioned "perl -pe 's/\b(.*?)/\u\L$1/g'" I actually wasn't
    > asking for any ObjC method with a look-alike syntax.
    > I actually wouldn't give a damn about "how" ("s///g") to pass a
    > regex pattern to a method. ;)
    >
    > I was rather asking whether RegExKit (or even RegExKitLite?) would
    > generally be able to perform RegEx driven string replacements
    > where the replacement string contains stuff like "match back-
    > references" (\1, \2, \<named>, …) or string modificators like "\L,
    > \U".
    >
    > Now to answer my own question:
    >
    > RegExKit has this function which (according to the documentation)
    > seems to do just what I was looking for:
    > [subjectString stringByMatching:regexString
    > withReferenceString:templateString];
    >
    > And for the latter (\L, \U, etc) I unfortunately had to find this in
    > the PCRE documentation:
    > 5.    The  following Perl escape sequences are not supported: \l,
    > \u, \L,
    > \U, and \N. In fact these are implemented by Perl's general
    > string-han-
    > dling  and are not part of its pattern matching engine. If any
    > of these
    > are encountered by PCRE, an error is generated.
    > http://www.pcre.org/pcre.txt
    >
    > Thanks a lot any way.
    > Vincent

    Actually, RegexKit handles perl like \u \l \U \L \E case conversions.

    http://regexkit.sourceforge.net/Documentation/NSString.html#ExpansionofCapt
    ureSubpatternMatchReferencesinStrings


    example = [NSString stringWithUTF8String:"Stra\xc3\x9f" "e"]; // Straße
    upper = [example stringByMatching:@"(.*)" replace:RKReplaceAll
    withReferenceString:@"\\U$1\\E"]; // STRASSE
    lower = [upper stringByMatching:@"(.*)" replace:RKReplaceAll
    withReferenceString:@"\\L$1\\E"]; // strasse != Straße

    As the example shows, case conversion is Unicode aware as well (ß ->
    SS -> ss).  Your example would become:

    [@"Some string to do work on" stringByMatching:@"\\b(.*?)"
    replace:RKReplaceAll withReferenceString:@"\\u\\L$1"];

    The replace:RKReplaceAll is the functional equivalent of the /g
    modifier.
  • On Sat, Jun 7, 2008 at 10:43 AM, Jason Stephenson <jason...> wrote:
    > As someone who has worked on a number of 3rd party [open source and
    > otherwise] frameworks, I wonder where this attitude comes from in the case
    > of Cocoa/Mac OS X. I have some ideas, but I hesitate to share them.

    Four things:
    1) There are certain basics like regex support that people are upset
    at Apple for not implementing because it seems like such an important
    part of the concept of strings.
    2) Licensing issues can arise for third-party frameworks.
    3) Objective-C has no namespaces, and categories are fragile.
    4) Linking against a third-party framework requires distributing the
    framework inside the app bundle.  Look at the proliferation of
    Sparkle.framework to see why this is a Bad Thing(TM).

    --Kyle Sluder
  • It is possible to link your application through C to an
    interpreter like Python or Perl, and rely on the built-in
    regular expression libraries to do your work.  If you
    really wanted to, you could fire off a call to /usr/bin/egrep.

    These are all part of the default Mac OS X platform, they
    require no dependency on a bundled framework, and have no
    license issues.

    In all honesty, you wouldn't want Apple to "implement" this
    itself, because they'd have to start from scratch and there
    would be bugs.  I listed 3 implementations that are very
    mature and powerful.

    The main downside in each case is that you're converting a
    small amount of code and strings to the target form (command
    line, Python or Perl code).

    Kevin G.

    >> As someone who has worked on a number of 3rd party [open source and
    >> otherwise] frameworks, I wonder where this attitude comes from in
    >> the case
    >> of Cocoa/Mac OS X. I have some ideas, but I hesitate to share them.
    >
    > Four things:
    > 1) There are certain basics like regex support that people are upset
    > at Apple for not implementing because it seems like such an important
    > part of the concept of strings.
    > 2) Licensing issues can arise for third-party frameworks.
    > 3) Objective-C has no namespaces, and categories are fragile.
    > 4) Linking against a third-party framework requires distributing the
    > framework inside the app bundle.  Look at the proliferation of
    > Sparkle.framework to see why this is a Bad Thing(TM).
    >
    > --Kyle Sluder
  • On Jun 7, 2008, at 12:37 PM, Kevin Grant wrote:

    > It is possible to link your application through C to an
    > interpreter like Python or Perl, and rely on the built-in
    > regular expression libraries to do your work.  If you
    > really wanted to, you could fire off a call to /usr/bin/egrep.
    >

    That last one would possibly be the worse of all possible ways - the
    cost involves spanning another process (potentially for each regex),
    sending a potentially large string to it, parsing the output back from
    it to get any sort of results, made all the more complicated by the
    fact that there are all sorts of weird involved when sending non-ASCII
    to another process (since environment variables get involved to
    determine encoding schemes - and you have no way of knowing what the
    user set up in the hand full of default encoding environment
    parameters that exist).

    NSTask works great for some things - shlepping large amount of string
    data back and forth for a simple utility isn't one of them.

    > These are all part of the default Mac OS X platform, they
    > require no dependency on a bundled framework, and have no
    > license issues.

    Can't speak for Perl, but the Python framework changes from OS version
    to OS version, which can potentially cause problems with linking to
    it, so suddenly this introduces an OS dependency.  There is no
    "forward compatibility guarantee" that is implicit in various parts of
    AppKit (granted, there are occasionally problems there, but  those
    tend to be the except

    >
    >
    > In all honesty, you wouldn't want Apple to "implement" this
    > itself, because they'd have to start from scratch and there
    > would be bugs.  I listed 3 implementations that are very
    > mature and powerful.

    Except that there already is a very good regex engine as part of the
    OS (ICU) which is used by some parts of the system already - it's just
    not fully exposed in Cocoa (so it's not like they'd be implementing it
    from scratch at all).  Asking that they take that last step seems like
    a very reasonable request (and it has been discussed for many-an-OS-
    release).

    Glenn Andreas                      <gandreas...>
      <http://www.gandreas.com/> wicked fun!
    m.o.t.e.s. | minute object twisted environment simulation
  • On Sat, Jun 7, 2008 at 10:43 AM, Jason Stephenson <jason...> wrote:
    > It seems that many on this list feel that Apple should provide everything
    > that the programmer needs to work on Mac OS X and that there should not be
    > 3rd party frameworks for much of anything.
    >
    > This attitude really, truly puzzles me because on every other platform where
    > I've programmed this attitude never came up in the discussion forums. It was
    > always just assumed that you would need to use 3rd party frameworks to get
    > any real work done, unless you intended to roll everything yourself.

    I think it's because Cocoa provides so much, but falls short in some
    strange places. You can easily get used to having Cocoa do all the
    work for you. XML parsing? Check. HTTP handling? Check. Media
    decoding? Check. Speech synthesis? Check. Regular expressions? Doh!

    Of course Mac OS X does come with a regex library, it just doesn't
    have an ObjC interface. There's more to what's available than Cocoa,
    and one of the great things about ObjC is how easy it is to talk to
    these pure C libraries and get them to do work for you as well.

    Mike
  • <snip/>

    Agree with your sentiments. Not everything needs to be shipped by
    default.

    > The only other environment where I've programmed that this same
    > attitude may rear its head could be Java land, but even there that
    > attitude does not seem to rear its head quite so often as it seems
    > to on this list.

    Naaaa...java did start shipping more and more stuff included over the
    years. It's also the huge amount of 3rd party libraries that makes is
    so appealing to so many people. But for some reason I still found it
    easier to find my way through the options available.

    Actually java even provides good examples where 3rd parties are be
    better than inclusions. Thinking of Crimson and the logging
    API ...sure there is even more.

    Anyway ....this is getting OT

    cheers
    --
    Torsten
  • On 6/7/08, Michael Ash <michael.ash...> wrote:

    > Of course Mac OS X does come with a regex library, it just doesn't
    > have an ObjC interface. There's more to what's available than Cocoa,
    > and one of the great things about ObjC is how easy it is to talk to
    > these pure C libraries and get them to do work for you as well.

    Many folks see regular expressions as a core part of today's
    frameworks (.NET, Java, Ruby, Python). ObjC, admittedly, looks a bit
    anemic in this area. There is no blessed solution and so the developer
    is required to research each of the many libraries to consider their
    pros and cons. Apple provides some libraries, but doesn't support
    linking against them (like ICU, which is used by Apple in Xcode but
    we've been told don't link against the system library).

    While it is always possible to drop down to the pure C libraries, I
    see that is an unnecessary step and akin to asking Cocoa developers to
    use C libraries to do string manipulation.

    For everyone that would like to see Apple add support for regex into
    the Cocoa frameworks, I highly recommend filing a bug via
    http://bugreporter.apple.com to let Apple know that they need to
    assign some engineering effort to this issue.

    --
    Mark Munz
    unmarked software
    http://www.unmarked.com/
  • On Sat, Jun 7, 2008 at 7:19 PM, Mark Munz <unmarked...> wrote:
    > On 6/7/08, Michael Ash <michael.ash...> wrote:
    >
    >> Of course Mac OS X does come with a regex library, it just doesn't
    >> have an ObjC interface. There's more to what's available than Cocoa,
    >> and one of the great things about ObjC is how easy it is to talk to
    >> these pure C libraries and get them to do work for you as well.
    >
    > Many folks see regular expressions as a core part of today's
    > frameworks (.NET, Java, Ruby, Python). ObjC, admittedly, looks a bit
    > anemic in this area. There is no blessed solution and so the developer
    > is required to research each of the many libraries to consider their
    > pros and cons. Apple provides some libraries, but doesn't support
    > linking against them (like ICU, which is used by Apple in Xcode but
    > we've been told don't link against the system library).

    I never cared about the lack of regex support personally, although I
    understand that people do use them. As far as a blessed solution goes,
    "man regex" gives you a library that's in libSystem and is part of
    POSIX, so it's as supported as you can get.

    > While it is always possible to drop down to the pure C libraries, I
    > see that is an unnecessary step and akin to asking Cocoa developers to
    > use C libraries to do string manipulation.

    I do this with a fair amount of regularity. NSString is unsuitable for
    working with data whose encoding is unknown or doubtful, and NSData
    doesn't have any string-like functionality, so the standard C str
    functions can be very useful here.

    Mike
  • On 7 Jun '08, at 10:24 AM, Kyle Sluder wrote:

    > 1) There are certain basics like regex support that people are upset
    > at Apple for not implementing because it seems like such an important
    > part of the concept of strings.

    Agreed, and I made this argument many times while there. Part of the
    problem is the impedance mismatch issue — Apple would want to use ICU
    because of its good Unicode support, but the way the ICU APIs are
    written generally requires copying the string to a temporary buffer
    (especially if you consider thread-safety.) There has been talk of
    extending ICU regexps to support plug-in string storage, which would
    be more efficient, and as far as I can tell, the whole regexps-for-
    Cocoa feature gets hung up on waiting for that, year after year.

    Filing bugs on it is going to annoy people on the Cocoa team at Apple
    because they'll have to flag them all as dups and send them back; but
    it's the only lever that's been given to 3rd party developers to
    influence this, so I can't fault anyone for yanking it.

    > 4) Linking against a third-party framework requires distributing the
    > framework inside the app bundle.  Look at the proliferation of
    > Sparkle.framework to see why this is a Bad Thing(TM).

    Not if it's in the OS, which most all of the regexp libraries are. (I
    know, you mean the Obj-C adapter framework, but in the case of
    RegexpKit-Lite that's just one class / source file, which I don't see
    as a problem. Just statically link it into your app.)

    —Jens
  • On 8 Jun '08, at 3:39 AM, Michael Ash wrote:

    > I never cared about the lack of regex support personally, although I
    > understand that people do use them. As far as a blessed solution goes,
    > "man regex" gives you a library that's in libSystem and is part of
    > POSIX, so it's as supported as you can get.

    And (as discussed a few weeks ago) it's not Unicode-savvy, which could
    bite the unwary developer in the ass, especially when attempting to
    localize their app into non-Roman languages like Japanese.

    > I do this with a fair amount of regularity. NSString is unsuitable for
    > working with data whose encoding is unknown or doubtful, and NSData
    > doesn't have any string-like functionality, so the standard C str
    > functions can be very useful here.

    Ouch. The problem with those is that, every time you call one, you've
    added a potential buffer overrun bug to your app. And if the data in
    the string came from an untrusted source like the network, that
    escalates to a potential security vulnerability.

    Also, speaking of doubtful encodings, the regular C string functions
    will fail quite badly on 16-bit character encodings, where it's more
    than likely that every other byte is a zero.

    My general tactic when dealing with unknown data whose encoding can't
    be determined is to just fall back on CP-1252 [though Aki Inoue
    suggested MacRoman], both of which are supersets of ascii that map
    every byte to a character. That way you'll always get a non-nil
    NSString, and any ascii text in the original will come out unscathed.
    That's a better result than you'll get with C string APIs.

    —Jens
  • On Jun 9, 2008, at 8:12 PM, Jens Alfke wrote:

    > On 7 Jun '08, at 10:24 AM, Kyle Sluder wrote:
    >
    >> 1) There are certain basics like regex support that people are upset
    >> at Apple for not implementing because it seems like such an important
    >> part of the concept of strings.
    >
    > Agreed, and I made this argument many times while there. Part of the
    > problem is the impedance mismatch issue — Apple would want to use
    > ICU because of its good Unicode support, but the way the ICU APIs
    > are written generally requires copying the string to a temporary
    > buffer (especially if you consider thread-safety.) There has been
    > talk of extending ICU regexps to support plug-in string storage,
    > which would be more efficient, and as far as I can tell, the whole
    > regexps-for-Cocoa feature gets hung up on waiting for that, year
    > after year.

    Is this the ICU problem you're talking about?

    http://lists.apple.com/archives/Cocoa-dev/2007/Sep/msg00416.html

    I thought I read on the Xcode users list that Xcode is using ICU for
    regex find-and-replace, so it's too bad the rest of us can't use it.

    Unfortunately, I think filing bug reports on this is a waste of time
    at this point.  I'm still using AGRegex, which is based on a pretty
    ancient PCRE, but it's predated by (at least) MOKit and
    OFRegularExpression:

    http://www.omnigroup.com/mailman/archive/macosx-dev/1998-December/006560.ht
    ml


    So this has been a recurring theme for a few years now...

    --
    Adam
  • On 6/9/08, Adam R. Maxwell <amaxwell...> wrote:
    >
    > Unfortunately, I think filing bug reports on this is a waste of time at
    > this point.  I'm still using AGRegex, which is based on a pretty ancient
    > PCRE, but it's predated by (at least) MOKit and OFRegularExpression:

    Filing bugs against this *IS NOT* a waste of time. Please don't
    discourage people from letting Apple know that this is an important
    issue for the developers. If a large number of developers file bugs
    against this, it lets Apple know that this is an area that deserves
    more resources and attention.

    Just wishing for the problem to go away or blaming external criteria
    will almost guarantee that nothing gets done. Filing bugs is how you,
    the developer, communicate your needs to Apple.

    Apple engineers have ALWAYS encourage me to log bugs against issues I
    mentioned to them, even if they are duplicates.

    --
    Mark Munz
    unmarked software
    http://www.unmarked.com/
  • On Jun 9, 2008, at 9:11 PM, Adam R. Maxwell wrote:
    >
    >
    > I thought I read on the Xcode users list that Xcode is using ICU for
    > regex find-and-replace, so it's too bad the rest of us can't use it.

    I recall the same. And further, I am of the understanding that
    NSPredicate uses ICU for its pattern matching -- can anyone confirm?

    ~~~

    For another regex solution, take a look at Objective PCRE:

    http://sourceforge.net/projects/objpcre

    For my regex needs, I link in libpcre and use objpcre as the "glue."

    objpcre accepts NSString objects for both the regular expression and
    the string to be evaluated. Whether (or how) this UTF-16 string is
    converted to UTF-8 (which, I believe, is native encoding for pcre),
    happens behind the scenes, and so is "invisible" to me.

    FWIW. HTH.
  • On Mon, Jun 9, 2008 at 8:17 PM, Jens Alfke <jens...> wrote:
    >
    > On 8 Jun '08, at 3:39 AM, Michael Ash wrote:
    >
    >> I do this with a fair amount of regularity. NSString is unsuitable for
    >> working with data whose encoding is unknown or doubtful, and NSData
    >> doesn't have any string-like functionality, so the standard C str
    >> functions can be very useful here.
    >
    > Ouch. The problem with those is that, every time you call one, you've added
    > a potential buffer overrun bug to your app. And if the data in the string
    > came from an untrusted source like the network, that escalates to a
    > potential security vulnerability.

    Sorry, what? It's perfectly possible to write safe code that calls C
    str functions. My code is no more vulnerable than the next man's. You
    can call things like strnstr, pass the length of the NSData you're
    working on, and there is exactly zero risk of anything.

    > Also, speaking of doubtful encodings, the regular C string functions will
    > fail quite badly on 16-bit character encodings, where it's more than likely
    > that every other byte is a zero.

    While true, this is also irrelevant when you know that your data is
    not, in fact, 16-bit. I use this technique when the data is known to
    be ASCII-like but exactly what kind of ASCII-like encoding is unknown.
    It would be nonsensical to use it for UTF-16 data, and thus I don't.

    > My general tactic when dealing with unknown data whose encoding can't be
    > determined is to just fall back on CP-1252 [though Aki Inoue suggested
    > MacRoman], both of which are supersets of ascii that map every byte to a
    > character. That way you'll always get a non-nil NSString, and any ascii text
    > in the original will come out unscathed. That's a better result than you'll
    > get with C string APIs.

    No, it's not. A common technique is to use C string APIs to find line
    endings, then try the full line as UTF-8. If it fails, then you can
    fall back on a more forgiving encoding. This will give correct results
    for UTF-8, which in many cases is the expected encoding, which is very
    nice to have. Turning well-formed UTF-8 text into long strings of
    nonsense characters is generally undesirable. It also has an extremely
    low probability of false positives, as it's difficult to construct a
    sensible string in a different encoding which is also valid UTF-8. The
    fallback guarantees that you can at least try something if you get
    data that you don't expect.

    You may not be familiar with this technique but that doesn't mean it's
    bad. It's good and useful in many situations.

    Mike
  • On 10 Jun 2008, at 05:12, Mark Munz wrote:

    > Just wishing for the problem to go away or blaming external criteria
    > will almost guarantee that nothing gets done. Filing bugs is how you,
    > the developer, communicate your needs to Apple.

    Since ICU is open source, the other productive thing to do would be to
    give the ICU folks a hand at writing whatever bits of gunk are
    required by Apple.

    Cheers,

    Chris
  • On 9 Jun '08, at 10:38 PM, Michael Ash wrote:

    > It's perfectly possible to write safe code that calls C
    > str functions. My code is no more vulnerable than the next man's. You
    > can call things like strnstr, pass the length of the NSData you're
    > working on, and there is exactly zero risk of anything.

    Sure, and it's perfectly possible to shave with a blade without
    cutting yourself; that doesn't mean it doesn't happen, though :/ What
    you're saying is "if you do everything right, there's zero risk of it
    being wrong", which is a tautology. The point is that people can and
    do make mistakes when working with C string APIs (even the "n" ones).

    > No, it's not. A common technique is to use C string APIs to find line
    > endings, then try the full line as UTF-8. If it fails, then you can
    > fall back on a more forgiving encoding.

    Yes, I do try UTF-8 first. Sorry, I was being brief in the previous
    message, describing only the _fallback_ if UTF-8 parsing fails.

    I'm not sure why you would want to use C APIs to look for line endings
    first, though?

    —Jens
  • On Tue, Jun 10, 2008 at 8:20 AM, Jens Alfke <jens...> wrote:
    >
    > On 9 Jun '08, at 10:38 PM, Michael Ash wrote:
    >
    >> It's perfectly possible to write safe code that calls C
    >> str functions. My code is no more vulnerable than the next man's. You
    >> can call things like strnstr, pass the length of the NSData you're
    >> working on, and there is exactly zero risk of anything.
    >
    > Sure, and it's perfectly possible to shave with a blade without cutting
    > yourself; that doesn't mean it doesn't happen, though :/ What you're saying
    > is "if you do everything right, there's zero risk of it being wrong", which
    > is a tautology. The point is that people can and do make mistakes when
    > working with C string APIs (even the "n" ones).

    This is true but meaningless. People can and do make mistakes with
    *everything*. The C string APIs don't have a particularly special
    place as far as security vulnerabilities go.

    >> No, it's not. A common technique is to use C string APIs to find line
    >> endings, then try the full line as UTF-8. If it fails, then you can
    >> fall back on a more forgiving encoding.
    >
    > Yes, I do try UTF-8 first. Sorry, I was being brief in the previous message,
    > describing only the _fallback_ if UTF-8 parsing fails.
    >
    > I'm not sure why you would want to use C APIs to look for line endings
    > first, though?

    When working with streaming data then you need to find a delimiter to
    safely cut the stream before trying UTF-8, because if the end of your
    chunk of data ends in the middle of a UTF-8 code word (or whatever
    it's called), then the result will be invalid UTF-8 even if the stream
    as a whole is valid UTF-8. You could write a UTF-8 parser to find good
    cut points, but it's much easier when working with a line-oriented
    protocol to just look for CRLF.

    Mike
  • On 10 Jun 2008, at 15:16, Chris Ridd wrote:

    > On 10 Jun 2008, at 05:12, Mark Munz wrote:
    >
    >> Just wishing for the problem to go away or blaming external criteria
    >> will almost guarantee that nothing gets done. Filing bugs is how you,
    >> the developer, communicate your needs to Apple.
    >
    > Since ICU is open source, the other productive thing to do would be
    > to give the ICU folks a hand at writing whatever bits of gunk are
    > required by Apple.

    "The ICU folks" actually includes Apple, since they have at least one
    person working on ICU (IIRC ICU was one of the results of Apple's
    collaboration with IBM on Taligent).  As I understand it, the main
    reason it isn't fully exposed is that ICU is primarily a C++ library
    and there have and continue to be binary compatibility issues with C++
    APIs.

    I think Apple is likely to add CFRegularExpression and a bridged
    NSRegularExpression at some point, and very probably a load of CF/
    NSString and NSScanner APIs to go with them, but it isn't a trivial
    amount of work and I imagine they will want to think it through
    carefully before deciding on the API, not to mention the supported
    regexp syntax(es).  By contrast, most of the regexp frameworks
    currently available are thin wrappers around an existing regular
    expression engine and have been knocked up quickly to fill a need.
    More thought needs to go into a system API and since CFStrings can
    exist in a number of different internal storage formats it may even
    make sense for Apple to do its own regexp implementation to achieve
    the best performance.

    Anyway, all of that is up to them.  As others have said, filing bug
    reports is the way to push for this work if you want it done.  Whether
    working on ICU or not will make things happen faster is something that
    only people inside Apple could tell us.  My guess is that it won't
    make any difference, but it's just a guess.

    Kind regards,

    Alastair.

    --
    http://alastairs-place.net
previous month june 2008 next month
MTWTFSS
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30            
Go to today