sscanf() for NSStrings?

  • Is there a sscanf() equivalent for NSStrings, or do I have to convert
    them to char* 's ?
  • On Oct 28, 2006, at 1:25 PM, Henry Skelton wrote:

    > Is there a sscanf() equivalent for NSStrings, or do I have to
    > convert them to char* 's ?

    NSScanner.

        - Scott
  • On 28 Oct 2006, at 13:25, Henry Skelton wrote:

    > Is there a sscanf() equivalent for NSStrings, or do I have to
    > convert them to char* 's ?

    Converting being simply [theString UTF8String] (this zero-terminated
    buffer is essentially autoreleased).

    ------------
    David Dunham  <dunham...>  http://www.pensee.com/dunham/
    "No matter how far you have gone on a wrong road, turn back." -
    Turkish proverb
  • Not quite--sscanf works on C strings in the default C locale, which
    isn't always equivalent to UTF8. In fact, on Mac OS X, I think the
    default C locale is MacRoman, which is definitely *not* equivalent to
    UTF8.

    However, if you know all your strings only use ASCII characters, then
    it's safe to pass [theString UTF8String] to sscanf(), at least on Mac
    OS X. This assumption may not be valid on other systems, such as
    those using EBCDIC encoding (e.g. IBM System/360.)

    Not that you're worried about being cross-platform compatible if
    you're using NSString objects...

    -Jonathan Grynspan

    On 28-Oct-06, at 5:26 PM, David Dunham wrote:

    >
    > On 28 Oct 2006, at 13:25, Henry Skelton wrote:
    >
    >> Is there a sscanf() equivalent for NSStrings, or do I have to
    >> convert them to char* 's ?
    >
    > Converting being simply [theString UTF8String] (this zero-
    > terminated buffer is essentially autoreleased).
    >
    > ------------
    > David Dunham  <dunham...>  http://www.pensee.com/dunham/
    > "No matter how far you have gone on a wrong road, turn back." -
    > Turkish proverb
    >
    > _______________________________________________
    > MacOSX-dev mailing list
    > <MacOSX-dev...>
    > http://www.omnigroup.com/mailman/listinfo/macosx-dev
  • On Oct 28, 2006, at 11:12 PM, Jonathan Grynspan wrote:

    > Not quite--sscanf works on C strings in the default C locale, which
    > isn't always equivalent to UTF8. In fact, on Mac OS X, I think the
    > default C locale is MacRoman, which is definitely *not* equivalent
    > to UTF8.

    I think the default C locale is *ASCII*, not MacRoman.  If you try to
    convert a character code over 0x7f from wide to multibyte using e.g.
    wcstombs(), then you just get a '?' if it's below 0x100, or, if it's
    over that, the conversion stops without finishing.  At least, that's
    what seems to happen on my machine.

    (This seems a little silly given that Terminal is UTF-8, and that the
    BSD filesystem APIs accept and return UTF-8 filenames, but it's what
    happens.)

    Here's a short test program if anyone is interested (I wrote out the
    character codes by hand to demonstrate that we're definitely using
    Unicode):

    #include <stdlib.h>
    #include <stdio.h>
    #include <string.h>
    #include <wchar.h>
    #include <locale.h>

    int
    main (void)
    {
      wchar_t wide[] = {
        0x4d, 0x61, 0x63, 0x20, 0x4f, 0x53, 0x20, 0x58, 0x20, 0x69,
    0x73, 0x20,
        0xa9, 0x20, 0x41, 0x70, 0x70, 0x6c, 0x65, 0x20, 0xf8ff, 0x20,
    0x263a,
        0x0a, 0x0000
      };
      char narrow[80];

      memset (narrow, 0, sizeof (narrow));
      wcstombs (narrow, wide, sizeof (narrow));

      wprintf(wide);
      puts (narrow);

      setlocale(LC_CTYPE, "en_GB.UTF-8");

      memset (narrow, 0, sizeof (narrow));
      wcstombs (narrow, wide, sizeof (narrow));

      wprintf(wide);
      puts (narrow);

      return 0;
    }

    Kind regards,

    Alastair.

    --
    http://alastairs-place.net
  • On Oct 28, 2006, at 16:47, Alastair Houghton wrote:

    > On Oct 28, 2006, at 11:12 PM, Jonathan Grynspan wrote:
    >
    >> Not quite--sscanf works on C strings in the default C locale,
    >> which isn't always equivalent to UTF8. In fact, on Mac OS X, I
    >> think the default C locale is MacRoman, which is definitely *not*
    >> equivalent to UTF8.
    >
    > I think the default C locale is *ASCII*, not MacRoman.  If you try
    > to convert a character code over 0x7f from wide to multibyte using
    > e.g. wcstombs(), then you just get a '?' if it's below 0x100, or,
    > if it's over that, the conversion stops without finishing.  At
    > least, that's what seems to happen on my machine.
    >
    > (This seems a little silly given that Terminal is UTF-8, and that
    > the BSD filesystem APIs accept and return UTF-8 filenames, but it's
    > what happens.)

    Is +[NSString defaultCStringEncoding] at all relevant to the default
    C locale?  On my systems (US) it's MacRoman, and that may be what
    Jonathan referred to.

    Adam
  • The default locale goes by the names "C" and "POSIX", and yes it is
    basically ASCII.  But this is no accident; it is required by UNIX
    conformance, and we can't change it.

    Ed

    On Oct 28, 2006, at 4:47 PM, Alastair Houghton wrote:

    > On Oct 28, 2006, at 11:12 PM, Jonathan Grynspan wrote:
    >
    >> Not quite--sscanf works on C strings in the default C locale,
    >> which isn't always equivalent to UTF8. In fact, on Mac OS X, I
    >> think the default C locale is MacRoman, which is definitely *not*
    >> equivalent to UTF8.
    >
    > I think the default C locale is *ASCII*, not MacRoman.  If you try
    > to convert a character code over 0x7f from wide to multibyte using
    > e.g. wcstombs(), then you just get a '?' if it's below 0x100, or,
    > if it's over that, the conversion stops without finishing.  At
    > least, that's what seems to happen on my machine.
    >
    > (This seems a little silly given that Terminal is UTF-8, and that
    > the BSD filesystem APIs accept and return UTF-8 filenames, but it's
    > what happens.)
    >
    > Here's a short test program if anyone is interested (I wrote out
    > the character codes by hand to demonstrate that we're definitely
    > using Unicode):
    >
    > #include <stdlib.h>
    > #include <stdio.h>
    > #include <string.h>
    > #include <wchar.h>
    > #include <locale.h>
    >
    > int
    > main (void)
    > {
    > wchar_t wide[] = {
    > 0x4d, 0x61, 0x63, 0x20, 0x4f, 0x53, 0x20, 0x58, 0x20, 0x69,
    > 0x73, 0x20,
    > 0xa9, 0x20, 0x41, 0x70, 0x70, 0x6c, 0x65, 0x20, 0xf8ff, 0x20,
    > 0x263a,
    > 0x0a, 0x0000
    > };
    > char narrow[80];
    >
    > memset (narrow, 0, sizeof (narrow));
    > wcstombs (narrow, wide, sizeof (narrow));
    >
    > wprintf(wide);
    > puts (narrow);
    >
    > setlocale(LC_CTYPE, "en_GB.UTF-8");
    >
    > memset (narrow, 0, sizeof (narrow));
    > wcstombs (narrow, wide, sizeof (narrow));
    >
    > wprintf(wide);
    > puts (narrow);
    >
    > return 0;
    > }
    >
    > Kind regards,
    >
    > Alastair.
    >
    > --
    > http://alastairs-place.net
    >
    >
    > _______________________________________________
    > MacOSX-dev mailing list
    > <MacOSX-dev...>
    > http://www.omnigroup.com/mailman/listinfo/macosx-dev
  • On 29 Oct 2006, at 01:22, Adam R. Maxwell wrote:

    > Is +[NSString defaultCStringEncoding] at all relevant to the
    > default C locale?  On my systems (US) it's MacRoman, and that may
    > be what Jonathan referred to.

    That's the default for NSString/CFString, not the C library AFAIK.

    I think you're right, though, I think that's why Jonathan thought the
    C library defaulted to MacRoman.

    Kind regards,

    Alastair.

    --
    http://alastairs-place.net
  • On 28-Oct-06, at 8:32 PM, Alastair Houghton wrote:

    > On 29 Oct 2006, at 01:22, Adam R. Maxwell wrote:
    >
    >> Is +[NSString defaultCStringEncoding] at all relevant to the
    >> default C locale?  On my systems (US) it's MacRoman, and that may
    >> be what Jonathan referred to.
    >
    > That's the default for NSString/CFString, not the C library AFAIK.
    >
    > I think you're right, though, I think that's why Jonathan thought
    > the C library defaulted to MacRoman.

    That's the encoding for a C string generated with -[NSString cString]
    (which is a deprecated method), and it's also the encoding for C
    strings in the default "C" locale on Mac OS X, including functions in
    the C library. It exists so that you can do something like (for
    example) the following:

    - (BOOL)blorkString: (NSString *)aString
    {
    const char *cStr = [aString cStringUsingEncoding: [NSString
    defaultCStringEncoding]];

    gzFile f = gzopen( cStr, "wb" );
    if ( f ) {
      /* write some data */
      return ( Z_OK == gzclose( f ) );
    }

    return NO;
    }

    That is, +[NSString defaultCStringEncoding] returns the encoding that
    C library functions expect. This is also why both -[NSString
    UTF8String] and -[NSString fileSystemRepresentation] both exist:
    they're not the same. +[NSString defaultCStringEncoding] currently
    returns MacRoman, but may change in the future; that's why you
    retrieve the value by calling a method rather than using a constant
    value.

    Consider this: how do you know the encoding of the constant string
    "Hello world!"? It's the default encoding for the "C" locale; to
    create an NSString with it, you have to use +[NSString
    stringWithCString:encoding:], and you have to pass +[NSString
    defaultCStringEncoding] as the second argument. You might be tempted
    to use +[NSString stringWithCString:], but that's deprecated.
    Besides, it just calls +[NSString stringWithCString:encoding:] for you.

    Terminal.app interprets its output as UTF-8 by default, but that's
    because most apps you run via Terminal.app are UNIX ports, and they
    rarely know about MacRoman. Other than MacRoman, UTF8 is often the
    safest bet for English locales when dealing with a C string of
    unknown encoding (such as the standard output of an application.) No
    UNIX or POSIX standard specifies the encoding for C strings--that's
    an implementation detail. Again, on Mac OS X, C strings are encoded
    by default as MacRoman.

    The best way to avoid this sort of confusion is to simply use
    NSStrings or CFStrings wherever possible, and to avoid APIs that
    expect C strings.

    -Jonathan Grynspan
  • > - (BOOL)blorkString: (NSString *)aString
    > {
    > const char *cStr = [aString cStringUsingEncoding: [NSString
    > defaultCStringEncoding]];
    >
    > gzFile f = gzopen( cStr, "wb" );
    > if ( f ) {
    > /* write some data */
    > return ( Z_OK == gzclose( f ) );
    > }
    >
    > return NO;
    > }

    Actually, that's a bad example. I should have used -[NSString
    fileSystemRepresentation]. Pretend I used some other C functions! :)

    -Jonathan Grynspan
  • On Oct 29, 2006, at 5:23 AM, Jonathan Grynspan wrote:

    > On 28-Oct-06, at 8:32 PM, Alastair Houghton wrote:
    >
    >> On 29 Oct 2006, at 01:22, Adam R. Maxwell wrote:
    >>
    >>> Is +[NSString defaultCStringEncoding] at all relevant to the
    >>> default C locale?  On my systems (US) it's MacRoman, and that may
    >>> be what Jonathan referred to.
    >>
    >> That's the default for NSString/CFString, not the C library AFAIK.
    >>
    >> I think you're right, though, I think that's why Jonathan thought
    >> the C library defaulted to MacRoman.
    >
    > That's the encoding for a C string generated with -[NSString
    > cString] (which is a deprecated method),

    Yes, that's true.

    > and it's also the encoding for C strings in the default "C" locale
    > on Mac OS X, including functions in the C library.

    No, it isn't.

    > No UNIX or POSIX standard specifies the encoding for C strings--
    > that's an implementation detail.

    No.  Ed Moy mentioned this in his reply to my previous message, and
    SUSv3/POSIX.1 says quite specifically (in Base Definitions, section
    6.2):

    "The POSIX locale contains the characters in the Portable Character
    Set, which have the properties listed in LC_CTYPE. In other locales,
    the presence, meaning, and representation of any additional
    characters are locale-specific."

    The Portable Character Set is basically ASCII.

    Unless you set the locale to something different, this is the locale
    you're using.

    I happen to think it would be nice if Mac OS X defaulted the locale
    to something that specified UTF-8, because of Terminal and the
    filesystem APIs, but it doesn't.  Nor does it default to something
    that specifies MacRoman.

    > Again, on Mac OS X, C strings are encoded by default as MacRoman.

    That's not true.  Play with the sample program I posted if you don't
    believe me.  You'll find that the *C library* refuses to convert
    characters over 0x7f to MacRoman when using the C/POSIX locale.

    Generally speaking, it's true that C run time library functions that
    handle strings tend to be 8-bit clean and as a result will work
    equally well with UTF-8 or MacRoman.  The cases where things start to
    matter are those involving the multibyte and wide character
    functions, which is why my sample program uses them.

    Kind regards,

    Alastair.

    --
    http://alastairs-place.net
  • On Oct 29, 2006, at 1:27 AM, Edward Moy wrote:

    > The default locale goes by the names "C" and "POSIX", and yes it is
    > basically ASCII.  But this is no accident; it is required by UNIX
    > conformance, and we can't change it.

    Sure.  You could have Mac OS X set an appropriate default---using
    UTF-8---for the native environment, though (i.e. the locale you get
    by doing setlocale(LC_ALL, "")).

    For instance, a simple solution might be to have Terminal start the
    shell with the LANG setting initialised based on System Preferences.
    Then users who don't want that can still set it explicitly in
    their .profile.

    Kind regards,

    Alastair.

    --
    http://alastairs-place.net
  • Thanks. I ended up using sscanf() with [theString UTF8] .

    On a side note, I found NSScanner to be extremely difficult to use
    and lacking in flexibility, before I decided to just use sscanf(). Is
    NSScanner better than the C scanners for some situations, or is it an
    attempt to make something object oriented that probably shouldn't be?

    On Oct 28, 2006, at 5:26 PM, David Dunham wrote:

    >
    > On 28 Oct 2006, at 13:25, Henry Skelton wrote:
    >
    >> Is there a sscanf() equivalent for NSStrings, or do I have to
    >> convert them to char* 's ?
    >
    > Converting being simply [theString UTF8String] (this zero-
    > terminated buffer is essentially autoreleased).
    >
    > ------------
    > David Dunham  <dunham...>  http://www.pensee.com/dunham/
    > "No matter how far you have gone on a wrong road, turn back." -
    > Turkish proverb
    >
    > _______________________________________________
    > MacOSX-dev mailing list
    > <MacOSX-dev...>
    > http://www.omnigroup.com/mailman/listinfo/macosx-dev
  • On Oct 29, 2006, at 7:05 AM, Henry Skelton wrote:

    > Thanks. I ended up using sscanf() with [theString UTF8] .
    >
    > On a side note, I found NSScanner to be extremely difficult to use
    > and lacking in flexibility, before I decided to just use sscanf().
    > Is NSScanner better than the C scanners for some situations, or is
    > it an attempt to make something object oriented that probably
    > shouldn't be?

    What are you trying to do exactly? ...NSScanner and sscanf are
    similar in nature but they do have differences because they are
    attacking the problem space in slightly different ways.

    -Shawn
  • On 29 okt 2006, at 16.05, Henry Skelton wrote:

    > On a side note, I found NSScanner to be extremely difficult to use
    > and lacking in flexibility, before I decided to just use sscanf().

    NSScanner definitively takes getting used to, and could be improved
    both with regards to performance and API - it should for example
    support regexp queries. But, that said, I have found that it gets the
    job done.

    What is it that you're missing? What type of problem are you trying
    to solve?

    > Is NSScanner better than the C scanners for some situations, or is
    > it an attempt to make something object oriented that probably
    > shouldn't be?

    What would be the drawback of encapsulating string scanning
    functionality as a class with associated methods?

    j o a r
previous month october 2006 next month
MTWTFSS
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          
Go to today