Re: Cocoa-dev Digest, Vol 5, Issue 1470

  • > On Fri, Aug 15, 2008 at 10:53 PM, John Joyce
    > <dangerwillrobinsondanger...> wrote:
    >> Right now, I'm toying with using Flex/Lex in a Cocoa project.
    >> Unfortunately, I don't see a reliable or easy way to handle NSStrings
    >> correctly all the time with Flex.
    >> Does anybody have any suggestions for such text handling and reliable
    >> unicode aware regexes?
    >> I'm seriously not interested in implementing such details in C with
    >> Flex.
    >> Flex is fast and cool for that, but if it's going to be stupidly
    >> difficult
    >> to use reliably with other languages on a mac, it's not a good idea
    >> for me.
    >
    > Depending on exactly what you need, unicode awareness can be fairly
    > straightforward.
    >
    > Commonly, unicode in regexes is only needed to pass through
    > undifferentiated blobs of text, with ASCII delimiters. For example,
    > imagine parsing a CSV file which potentially has unicode text inside
    > the quotes. For this case, you can convert the file to UTF-8, and then
    > constructs like . will accept them. All non-ASCII characters in UTF-8
    > are represented as bytes 128-255, so if you just pass those through
    > then you'll be fine. But be aware of some potential problem areas:
    >
    > - Each non-ASCII character will be more than one byte, and flex will
    > think of it as more than one character. Write your regexes
    > accordingly. In particular, avoid length limits on runs of arbitrary
    > characters, and avoid using non-ASCII characters directly in your
    > regex.
    >
    > - It's very difficult to split UTF-8 strings correctly. If you
    > encounter a run of non-ASCII characters, ensure that you follow that
    > run through the end, until you get back to ASCII. Don't have a regex
    > that stops in the middle of it and then expects your code to be able
    > to do something useful with it.
    >
    > - If you need to do something with non-ASCII characters besides read
    > them in one side and write them out the other, for example doing
    > something special with all accented characters, then Flex is probably
    > not the right answer.
    >
    > Besides this it ought to be pretty straightforward. Since Flex just
    > passes your code straight through to the compiler, you can write
    > Objective-C in the actions (as long as you compile the result as
    > Objective-C, of course!), convert the text from UTF-8 back to an
    > NSString, and take things from there.
    >
    > Mike
    >

    Thanks to all.
    Mike, your answer especially set it on the right path that Flex is not
    going to do what I would like to do, at least not without a lot of
    work that might be silly.
    Certainly, I could extract ranges of strings that are within the
    ranges of ASCII and apply rules to them and then lump other stuff into
    a separate group, but I'd like to have more control.

    If I were willing to just do ASCII, it would be a wonderful thing,
    since Flex is so fast.
    I understand well enough how difficult it can be to establish rules
    for unicode strings when there are a lot of semantics possible
    depending on the language.
    In my case, I am mainly interested in working with Japanese text,
    which does have difficulty due to a general lack of white space to
    rely on, but beyond that not so much more than a large character set.
    I want to be able to use Japanese strings as tokens throughout
    something.
    I seriously wonder how sophisticated syntax highlighting in Xcode
    really is... it does handle things quite well!

    As cool and wonderful as Flex is, it just isn't going to be reliable
    for what I want to do.
previous month august 2008 next month
MTWTFSS
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Go to today