Two Questions About Parsing a String

  • Hi,

    I'm looking for some help parsing a string from a file.

    Firstly, getting the string is causing some issues. I read the String
    Programming Guide for Cocoa, and got this:

    NSString *path = ...;
    NSData *data = [NSData dataWithContentsOfFile:path];

    // assuming data is in UTF8
    NSString *string = [NSString stringWithUTF8String:[data bytes]];

    along with a warning that you must not use:

      stringWithContentsOfFile:

    So, I tried to do as I was told, but using the first example, I find
    that successfully getting a string from the data is quite random. When
    I start the application, my first attempt to load the file may or may
    not result and a string being created, but if not, repeatedly trying
    to load the file eventually works and I get the string. I'm not sure
    what the encoding is, (I'm trying to create a .obj reader), but
    setting it at UTF8 in Xcode doesn't help.
    So, I took a peek at the dark side, and tried the soon to be
    deprecated  stringWithContentsOfFile: which works fine all the time.
    Any thoughts on why the first example might be failing randomly?

    My second question is NSScanner related. Going through the .obj file,
    I have managed to get a system going where it scans for key characters
    (e.g. "v" for vertex, "#" for commented text etc). Roughly:

    while ([theScanner isAtEnd] == NO)
    {
      [theScanner scanUpToCharactersFromSet:vCharacters intoString:NULL];
      [theScanner scanCharactersFromSet:vCharacters intoString:&testString];

      if ([testString isEqualToString:@"#"]) {
      [theScanner scanUpToString:@"\n" intoString:&dumpString];
      NSLog(@"dumpString is %@", dumpString);
      }

      else if ([testString isEqualToString:@"o"]) {
      [theScanner scanUpToString:@"\n" intoString:&theObjectName];
      NSLog(@"name: %@", theObjectName);
      }

      else if ([testString isEqualToString:@"v"]) {
      [theScanner scanFloat:&xVert];
      [theScanner scanFloat:&yVert];
      [theScanner scanFloat:&zVert];
      NSLog(@"Vertex %i is: x = %f, y = %f, z = %f", ++i, xVert, yVert,
    zVert);
      }
    }
    }

    However, the files also include identifiers such as "usemtl", which
    could appear at any time. So any ideas how you go about searching for
    a set of characters, and a set of strings simultaneously? i.e. how do
    I search for the characters without momentarily ignoring the strings
    or vice versa? This seems to be quite straightforward with fscanf, but
    it seems a bit odd going to C, when I'm trying to do this in Objective-
    C.

    Any help with either question would be much appreciated.

    Thank you,

    Ian.
  • On 22-Jul-08, at 11:09 AM, Ian Jackson wrote:

    > I'm looking for some help parsing a string from a file.
    >
    > Firstly, getting the string is causing some issues. I read the
    > String Programming Guide for Cocoa, and got this:
    >
    > NSString *path = ...;
    > NSData *data = [NSData dataWithContentsOfFile:path];
    >
    > // assuming data is in UTF8
    > NSString *string = [NSString stringWithUTF8String:[data bytes]];
    >
    > along with a warning that you must not use:
    >
    > stringWithContentsOfFile:
    >
    > So, I tried to do as I was told, but using the first example, I find
    > that successfully getting a string from the data is quite random.
    > When I start the application, my first attempt to load the file may
    > or may not result and a string being created, but if not, repeatedly
    > trying to load the file eventually works and I get the string. I'm
    > not sure what the encoding is, (I'm trying to create a .obj reader),
    > but setting it at UTF8 in Xcode doesn't help.
    > So, I took a peek at the dark side, and tried the soon to be
    > deprecated  stringWithContentsOfFile: which works fine all the time.
    > Any thoughts on why the first example might be failing randomly?
    >
    >

    Have you tried the non-deprecated
    stringWithContentsOfFile:usedEncoding:error: or
    stringWithContentsOfFile:encoding:error: ? The former actually attemps
    to determine the encoding used for the file and returns that by
    reference. They also allow error handling, so you can determine why
    your files may not be read successfully.

    Cheers, Patrick
  • On 22 Jul '08, at 2:09 AM, Ian Jackson wrote:

    > NSString *path = ...;
    > NSData *data = [NSData dataWithContentsOfFile:path];
    > // assuming data is in UTF8
    > NSString *string = [NSString stringWithUTF8String:[data bytes]];

    The reason this doesn't work is that -stringWithUTF8String: expects a
    NUL-terminated C string, but [data bytes] just returns the raw
    contents of the data block. So the string factory method will keep
    reading past the end of the data until it finds a zero byte in
    whatever happens to be randomly out there. That means it'll read
    garbage past the end of the string, and if that garbage doesn't look
    like valid UTF-8, it'll fail.

    The correct call to make would be
    [[NSString alloc] initWithData: data encoding: NSUTF8StringEncoding]
    although as Patrick already replied, the best way to read a string
    from a file is +stringWithContentsOfFile:usedEncoding:error:, which
    will attempt to determine the encoding.

    > However, the files also include identifiers such as "usemtl", which
    > could appear at any time. So any ideas how you go about searching
    > for a set of characters, and a set of strings simultaneously? i.e.
    > how do I search for the characters without momentarily ignoring the
    > strings or vice versa? This seems to be quite straightforward with
    > fscanf, but it seems a bit odd going to C, when I'm trying to do
    > this in Objective-C.

    This is beyond what NSScanner can do. You have a number of options,
    like scanning the string character by character using a 'for' loop,
    using a parser generator like ANTLR, or simply calling fscanf.
    (There's nothing wrong with using C APIs, when appropriate.)

    —Jens
  • Thanks for your responses.

    Looks like stringWithContentsOfFile:encoding:error:  does what I need.

    Jens, at least I know not to pursue the NSScanner thing any further in
    this case.

    Thanks,

    Ian.

    On 23/07/2008, at 3:39 AM, Jens Alfke wrote:

    >
    > On 22 Jul '08, at 2:09 AM, Ian Jackson wrote:
    >
    >> NSString *path = ...;
    >> NSData *data = [NSData dataWithContentsOfFile:path];
    >> // assuming data is in UTF8
    >> NSString *string = [NSString stringWithUTF8String:[data bytes]];
    >
    > The reason this doesn't work is that -stringWithUTF8String: expects
    > a NUL-terminated C string, but [data bytes] just returns the raw
    > contents of the data block. So the string factory method will keep
    > reading past the end of the data until it finds a zero byte in
    > whatever happens to be randomly out there. That means it'll read
    > garbage past the end of the string, and if that garbage doesn't look
    > like valid UTF-8, it'll fail.
    >
    > The correct call to make would be
    > [[NSString alloc] initWithData: data encoding: NSUTF8StringEncoding]
    > although as Patrick already replied, the best way to read a string
    > from a file is +stringWithContentsOfFile:usedEncoding:error:, which
    > will attempt to determine the encoding.
    >
    >> However, the files also include identifiers such as "usemtl", which
    >> could appear at any time. So any ideas how you go about searching
    >> for a set of characters, and a set of strings simultaneously? i.e.
    >> how do I search for the characters without momentarily ignoring the
    >> strings or vice versa? This seems to be quite straightforward with
    >> fscanf, but it seems a bit odd going to C, when I'm trying to do
    >> this in Objective-C.
    >
    > This is beyond what NSScanner can do. You have a number of options,
    > like scanning the string character by character using a 'for' loop,
    > using a parser generator like ANTLR, or simply calling fscanf.
    > (There's nothing wrong with using C APIs, when appropriate.)
    >
    > —Jens