Using Flex/Lex in a Cocoa project

  • Right now, I'm toying with using Flex/Lex in a Cocoa project.
    Unfortunately, I don't see a reliable or easy way to handle NSStrings
    correctly all the time with Flex.
    Does anybody have any suggestions for such text handling and reliable
    unicode aware regexes?
    I'm seriously not interested in implementing such details in C with
    Flex.
    Flex is fast and cool for that, but if it's going to be stupidly
    difficult to use reliably with other languages on a mac, it's not a
    good idea for me.
  • Just a thought, I haven't tried it, but the Core Foundation code is
    pure C, so you could send the NSString as a CFString, which is "toll-
    free bridged", which should mean you don't even have to make a cast
    (though, again, I haven't worked with this), but basically, you should
    be able to create a pure C function that invokes the lexer, and write
    the flex code to use the Core Foundation libraries, accepting a
    CFString, and I think that's all you'd need.

    Dustin
    KC9MEL

    On Aug 15, 2008, at 9:53 PM, John Joyce wrote:

    > Right now, I'm toying with using Flex/Lex in a Cocoa project.
    > Unfortunately, I don't see a reliable or easy way to handle
    > NSStrings correctly all the time with Flex.
    > Does anybody have any suggestions for such text handling and
    > reliable unicode aware regexes?
    > I'm seriously not interested in implementing such details in C with
    > Flex.
    > Flex is fast and cool for that, but if it's going to be stupidly
    > difficult to use reliably with other languages on a mac, it's not a
    > good idea for me.
  • On 16 Aug 2008, at 1:56 pm, Dustin Robert Kick wrote:

    > Just a thought, I haven't tried it, but the Core Foundation code is
    > pure C, so you could send the NSString as a CFString, which is "toll-
    > free bridged", which should mean you don't even have to make a cast
    > (though, again, I haven't worked with this), but basically, you
    > should be able to create a pure C function that invokes the lexer,
    > and write the flex code to use the Core Foundation libraries,
    > accepting a CFString, and I think that's all you'd need.
    >
    >
    >
    > Dustin
    > KC9MEL
    >
    >
    > On Aug 15, 2008, at 9:53 PM, John Joyce wrote:
    >
    >> Right now, I'm toying with using Flex/Lex in a Cocoa project.
    >> Unfortunately, I don't see a reliable or easy way to handle
    >> NSStrings correctly all the time with Flex.
    >> Does anybody have any suggestions for such text handling and
    >> reliable unicode aware regexes?
    >> I'm seriously not interested in implementing such details in C with
    >> Flex.
    >> Flex is fast and cool for that, but if it's going to be stupidly
    >> difficult to use reliably with other languages on a mac, it's not a
    >> good idea for me.
    >>

    It's a bit unclear what you're asking. Is it about using unicode with
    Flex, or using Flex with Cocoa?

    Since Obj-C is C plus a few bells and whistles, there's no difficulty
    in calling Flex from Cocoa - the comment about using CFString instead
    of NSString doesn't seem particularly relevant - since it's just a
    pointer you can pass it into C code either way. Where Flex would need
    to access NSString you could do that via an external wrapper function
    compiled as Obj-C or just compile Flex using a .m file instead of .c
    and call NSString's Obj-C methods directly. That should work, not that
    I've tried it...

    I have used BISON with Cocoa in a similar way and didn't run into any
    particular difficulties.

    hth,

    Graham
  • On Fri, Aug 15, 2008 at 10:53 PM, John Joyce
    <dangerwillrobinsondanger...> wrote:
    > Right now, I'm toying with using Flex/Lex in a Cocoa project.
    > Unfortunately, I don't see a reliable or easy way to handle NSStrings
    > correctly all the time with Flex.
    > Does anybody have any suggestions for such text handling and reliable
    > unicode aware regexes?
    > I'm seriously not interested in implementing such details in C with Flex.
    > Flex is fast and cool for that, but if it's going to be stupidly difficult
    > to use reliably with other languages on a mac, it's not a good idea for me.

    Depending on exactly what you need, unicode awareness can be fairly
    straightforward.

    Commonly, unicode in regexes is only needed to pass through
    undifferentiated blobs of text, with ASCII delimiters. For example,
    imagine parsing a CSV file which potentially has unicode text inside
    the quotes. For this case, you can convert the file to UTF-8, and then
    constructs like . will accept them. All non-ASCII characters in UTF-8
    are represented as bytes 128-255, so if you just pass those through
    then you'll be fine. But be aware of some potential problem areas:

    - Each non-ASCII character will be more than one byte, and flex will
    think of it as more than one character. Write your regexes
    accordingly. In particular, avoid length limits on runs of arbitrary
    characters, and avoid using non-ASCII characters directly in your
    regex.

    - It's very difficult to split UTF-8 strings correctly. If you
    encounter a run of non-ASCII characters, ensure that you follow that
    run through the end, until you get back to ASCII. Don't have a regex
    that stops in the middle of it and then expects your code to be able
    to do something useful with it.

    - If you need to do something with non-ASCII characters besides read
    them in one side and write them out the other, for example doing
    something special with all accented characters, then Flex is probably
    not the right answer.

    Besides this it ought to be pretty straightforward. Since Flex just
    passes your code straight through to the compiler, you can write
    Objective-C in the actions (as long as you compile the result as
    Objective-C, of course!), convert the text from UTF-8 back to an
    NSString, and take things from there.

    Mike
  • to avoid the splitting problem

    (c < 128) ? "%c" : "\\u%04x", c);

    On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <michael.ash...> wrote:
    > On Fri, Aug 15, 2008 at 10:53 PM, John Joyce
    > <dangerwillrobinsondanger...> wrote:
    >> Right now, I'm toying with using Flex/Lex in a Cocoa project.
    >> Unfortunately, I don't see a reliable or easy way to handle NSStrings
    >> correctly all the time with Flex.
    >> Does anybody have any suggestions for such text handling and reliable
    >> unicode aware regexes?
    >> I'm seriously not interested in implementing such details in C with Flex.
    >> Flex is fast and cool for that, but if it's going to be stupidly difficult
    >> to use reliably with other languages on a mac, it's not a good idea for me.
    >
    > Depending on exactly what you need, unicode awareness can be fairly
    > straightforward.
    >
    > Commonly, unicode in regexes is only needed to pass through
    > undifferentiated blobs of text, with ASCII delimiters. For example,
    > imagine parsing a CSV file which potentially has unicode text inside
    > the quotes. For this case, you can convert the file to UTF-8, and then
    > constructs like . will accept them. All non-ASCII characters in UTF-8
    > are represented as bytes 128-255, so if you just pass those through
    > then you'll be fine. But be aware of some potential problem areas:
    >
    > - Each non-ASCII character will be more than one byte, and flex will
    > think of it as more than one character. Write your regexes
    > accordingly. In particular, avoid length limits on runs of arbitrary
    > characters, and avoid using non-ASCII characters directly in your
    > regex.
    >
    > - It's very difficult to split UTF-8 strings correctly. If you
    > encounter a run of non-ASCII characters, ensure that you follow that
    > run through the end, until you get back to ASCII. Don't have a regex
    > that stops in the middle of it and then expects your code to be able
    > to do something useful with it.
    >
    > - If you need to do something with non-ASCII characters besides read
    > them in one side and write them out the other, for example doing
    > something special with all accented characters, then Flex is probably
    > not the right answer.
    >
    > Besides this it ought to be pretty straightforward. Since Flex just
    > passes your code straight through to the compiler, you can write
    > Objective-C in the actions (as long as you compile the result as
    > Objective-C, of course!), convert the text from UTF-8 back to an
    > NSString, and take things from there.
    >
    > Mike
    >

    --
    -mmw
  • On Aug 18, 2008, at 3:40 PM, mm w wrote:

    > to avoid the splitting problem
    >
    > (c < 128) ? "%c" : "\\u%04x", c);

    I'm not sure what this solves.

    Per Michael's e-mail below, this is indeed a difficult problem.  UTF-8
    is just a particular scheme to store Unicode strings.  Operating on
    individual bytes in such streams will most likely not make any sense.

    What I would do is pick some normalized form and operate on that
    data.  For a recent feature at my day job, we normalized all input CSV
    files to UTF-16BE.  We were able to handle all of our customer data so
    far.  The final solution still isn't 100% Unicode-savvy (e.g. it does
    crap-out with surrogate pairs), but we have unit tests to expose/
    document such limitations. And, customer data doesn't yet have such
    things.

    > On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <michael.ash...>
    > wrote:
    >> - It's very difficult to split UTF-8 strings correctly. If you
    >> encounter a run of non-ASCII characters, ensure that you follow that
    >> run through the end, until you get back to ASCII. Don't have a regex
    >> that stops in the middle of it and then expects your code to be able
    >> to do something useful with it.
    >>

    ___________________________________________________________
    Ricky A. Sharp        mailto:<rsharp...>
    Instant Interactive(tm)  http://www.instantinteractive.com
  • if you knew flex you could understand

    On Mon, Aug 18, 2008 at 1:55 PM, Ricky Sharp <rsharp...> wrote:
    >
    > On Aug 18, 2008, at 3:40 PM, mm w wrote:
    >
    >> to avoid the splitting problem
    >>
    >> (c < 128) ? "%c" : "\\u%04x", c);
    >
    > I'm not sure what this solves.
    >
    > Per Michael's e-mail below, this is indeed a difficult problem.  UTF-8 is
    > just a particular scheme to store Unicode strings.  Operating on individual
    > bytes in such streams will most likely not make any sense.
    >
    > What I would do is pick some normalized form and operate on that data.  For
    > a recent feature at my day job, we normalized all input CSV files to
    > UTF-16BE.  We were able to handle all of our customer data so far.  The
    > final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with
    > surrogate pairs), but we have unit tests to expose/document such
    > limitations. And, customer data doesn't yet have such things.
    >
    >
    >> On Sat, Aug 16, 2008 at 7:43 AM, Michael Ash <michael.ash...>
    >> wrote:
    >>>
    >>> - It's very difficult to split UTF-8 strings correctly. If you
    >>> encounter a run of non-ASCII characters, ensure that you follow that
    >>> run through the end, until you get back to ASCII. Don't have a regex
    >>> that stops in the middle of it and then expects your code to be able
    >>> to do something useful with it.
    >>>
    >
    > ___________________________________________________________
    > Ricky A. Sharp        mailto:<rsharp...>
    > Instant Interactive(tm)  http://www.instantinteractive.com
    >
    >
    >
    >

    --
    -mmw
  • On Aug 18, 2008, at 7:01 PM, <cocoa-dev-request...> wrote:

    >
    > to avoid the splitting problem
    >
    > (c < 128) ? "%c" : "\\u%04x", c);
    Not quite sure what this is doing.
    I see it's checking for ASCII range
    if ( c < 128 )
    The conditional is obvious,
    but what's the other doing exactly?
    returning a char if it is ASCII, it seems,
    and then some sort of escaped version if it is beyond ASCII range...?
    Particularly there, I'm not sure what that results in.

    Basically, I only need to do anything based on characters that are in
    ASCII now, but I'd like to allow other ranges in the text files
    without worry. Browsing around, I've seen where basically, everyone is
    hoping somebody else will modernize lex and yacc with a clever
    algorithm that reduces the overhead. Personally, I think that in many
    cases, this should be no problem with the speed and capability of
    contemporary computers, but it could still be a drag.

    I have been looking at simply building an NSString of the same length
    with ranges of non-ASCII subbed out with some other character, just to
    do the lexing, then apply the results of the lex to the original
    NSString.
    For speed, it may even make sense to simply have two NSStrings going,
    one that is the real thing, the other that auto-substitutes anything
    non-ASCII. For this, my question to all is, what ASCII character would
    be good for the substitution without messing up the regexes? I'm
    considering some the unused control characters in the lower ranges,
    but I'm a little scared to see what will happen...
    Any suggestions on this idea?

    Cheers,
    JJ
  • On Aug 18, 2008, at 8:01 PM, John Joyce wrote:

    >
    > On Aug 18, 2008, at 7:01 PM, <cocoa-dev-request...> wrote:
    >
    >>
    >> to avoid the splitting problem
    >>
    >> (c < 128) ? "%c" : "\\u%04x", c);
    > Not quite sure what this is doing.
    > I see it's checking for ASCII range
    > if ( c < 128 )
    > The conditional is obvious,
    > but what's the other doing exactly?
    > returning a char if it is ASCII, it seems,
    > and then some sort of escaped version if it is beyond ASCII range...?
    > Particularly there, I'm not sure what that results in.

    That was my question too.  If operating on a UTF-8 stream, this is
    going to do all kinds of weird stuff.

    For example, for the input string LATIN SMALL LETTER E WITH ACUTE (U
    +00E9), you'll have a UTF-8 byte sequence of 0xC3 0xA9.

    But the above will turn that UTF-8 sequence of bytes into this string:

    "\u00C3\u00A9"

    This string now represents:

    LATIN CAPITAL LETTER A WITH TILDE (U+00C3) followed by COPYRIGHT SIGN
    (U+00A9)

    ___________________________________________________________
    Ricky A. Sharp        mailto:<rsharp...>
    Instant Interactive(tm)  http://www.instantinteractive.com
  • On Aug 18, 2008, at 8:12 PM, Ricky Sharp wrote:

    >
    > On Aug 18, 2008, at 8:01 PM, John Joyce wrote:
    >
    >>
    >> On Aug 18, 2008, at 7:01 PM, <cocoa-dev-request...> wrote:
    >>
    >>>
    >>> to avoid the splitting problem
    >>>
    >>> (c < 128) ? "%c" : "\\u%04x", c);
    >> Not quite sure what this is doing.
    >> I see it's checking for ASCII range
    >> if ( c < 128 )
    >> The conditional is obvious,
    >> but what's the other doing exactly?
    >> returning a char if it is ASCII, it seems,
    >> and then some sort of escaped version if it is beyond ASCII range...?
    >> Particularly there, I'm not sure what that results in.
    >
    >
    > That was my question too.  If operating on a UTF-8 stream, this is
    > going to do all kinds of weird stuff.
    >
    > For example, for the input string LATIN SMALL LETTER E WITH ACUTE (U
    > +00E9), you'll have a UTF-8 byte sequence of 0xC3 0xA9.
    >
    > But the above will turn that UTF-8 sequence of bytes into this string:
    >
    > "\u00C3\u00A9"
    >
    > This string now represents:
    >
    > LATIN CAPITAL LETTER A WITH TILDE (U+00C3) followed by COPYRIGHT
    > SIGN (U+00A9)
    >
    > ___________________________________________________________
    > Ricky A. Sharp        mailto:<rsharp...>

    I wonder if it wouldn't make sense to just start trying to build some
    new form of flex in Objective-C... so that it uses NSString and
    NSMutable string ?
    I'm looking at the Flex source code now... in true GNU fashion, it is
    well documented, but somewhat terse C...
  • On 19 Aug 2008, at 11:53 am, John Joyce wrote:

    > I wonder if it wouldn't make sense to just start trying to build
    > some new form of flex in Objective-C... so that it uses NSString and
    > NSMutable string ?
    > I'm looking at the Flex source code now... in true GNU fashion, it
    > is well documented, but somewhat terse C...

    And you know about NSScanner, right?

    Graham
  • On Mon, Aug 18, 2008 at 4:55 PM, Ricky Sharp <rsharp...> wrote:
    >
    > On Aug 18, 2008, at 3:40 PM, mm w wrote:
    >
    >> to avoid the splitting problem
    >>
    >> (c < 128) ? "%c" : "\\u%04x", c);
    >
    > I'm not sure what this solves.
    >
    > Per Michael's e-mail below, this is indeed a difficult problem.  UTF-8 is
    > just a particular scheme to store Unicode strings.  Operating on individual
    > bytes in such streams will most likely not make any sense.
    >
    > What I would do is pick some normalized form and operate on that data.  For
    > a recent feature at my day job, we normalized all input CSV files to
    > UTF-16BE.  We were able to handle all of our customer data so far.  The
    > final solution still isn't 100% Unicode-savvy (e.g. it does crap-out with
    > surrogate pairs), but we have unit tests to expose/document such
    > limitations. And, customer data doesn't yet have such things.

    Note that depending on what kind of results you want, even if all of
    your data is within the BMP, this *still* won't save you.

    As a really basic example, consider a simple, obvious character like
    é. (That's an e with an acute accent on it if you're having unicode
    trouble in your e-mail client.) That can be represented as two
    separate unicode code points, a plain old ASCII e followed by a
    combining accent mark. If you should happen to split the string on the
    accent mark, such that the e goes into the first half and the
    combining accent mark goes into the second half, you get a really
    unintuitive result. What appears to the user to be a single character
    gets suddenly blown in two. Worse, if you happen to insert a string in
    the middle, you could end up applying that acute accent to some
    *other* letter instead.

    And if you think this is bad, you should see how Unicode deals with Korean.

    If you're using NSString, you can find good places to split using the
    -rangeOfComposedCharacterSequenceAtIndex: method. I believe that it
    will also deal with surrogate pairs, not only "normal" composed
    character sequences.

    Ultimately if you're doing any manipulation of Unicode, some large
    amount of knowledge about Unicode needs to be in the system somewhere.
    If your code is running on a Mac then you can use the knowledge that
    NSString has about Unicode to help out, sometimes. But alas, due to
    how Unicode is designed, there's simply no way to safely manipulate
    strings beyond very basic operations like concatenation unless you
    either make the code know a lot about Unicode or place overly strong
    constraints on the system, such as only splitting on line breaks or
    carriage returns (or commas).

    Yeah, the situation kind of sucks, but it's what we're stuck with.
    Thankfully Foundation and CoreFoundation do a lot to hide the messy,
    ugly details from us.

    Mike
  • On Aug 18, 2008, at 9:23 PM, Graham Cox wrote:

    >
    > On 19 Aug 2008, at 11:53 am, John Joyce wrote:
    >
    >> I wonder if it wouldn't make sense to just start trying to build
    >> some new form of flex in Objective-C... so that it uses NSString
    >> and NSMutable string ?
    >> I'm looking at the Flex source code now... in true GNU fashion, it
    >> is well documented, but somewhat terse C...
    >
    >
    > And you know about NSScanner, right?
    >
    > Graham
    Yes. I'm looking at it.
    I'm also looking at how Smultron uses RegexKit.
  • On Aug 18, 2008, at 10:57 PM, Michael Ash wrote:

    > Note that depending on what kind of results you want, even if all of
    > your data is within the BMP, this *still* won't save you.
    >
    > As a really basic example, consider a simple, obvious character like
    > é. (That's an e with an acute accent on it if you're having unicode
    > trouble in your e-mail client.) That can be represented as two
    > separate unicode code points, a plain old ASCII e followed by a
    > combining accent mark. If you should happen to split the string on the
    > accent mark, such that the e goes into the first half and the
    > combining accent mark goes into the second half, you get a really
    > unintuitive result. What appears to the user to be a single character
    > gets suddenly blown in two. Worse, if you happen to insert a string in
    > the middle, you could end up applying that acute accent to some
    > *other* letter instead.

    Sorry, failed to mention that our UTF-16BE data was also normalized to
    pre-composed Unicode.  So this case was handled.

    You mentioned Korean (which I have yet to play around with), but for
    another grand 'ol time, try Arabic.  You get into something called
    "positional variants".  But alas, that's outside the scope of this list.

    I think the moral of the story here is that when working with Unicode
    data, it's best to normalize such data and then ensure APIs operating
    on the data are Unicode savvy.

    Thankfully, as you've pointed out, the NSString etc. APIs shield folks
    from much of the gory details.

    ___________________________________________________________
    Ricky A. Sharp        mailto:<rsharp...>
    Instant Interactive(tm)  http://www.instantinteractive.com
previous month august 2008 next month
MTWTFSS
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Go to today