How to get an NSString from a non-terminated array of unicode chars (length is known)

  • My problem is that I receive a function call from a C library that
    gives me a wchar_t array and its length. The unicode array is _not_
    terminated.

    The library defines an XML_Char type, so my code below refers to
    that, but XML_Char is wchar_t (which, I believe is UTF8 on a Mac).

    I'm very weak with C, so please forgive my perhaps naive attempts here.

    I tried this approach:

    NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    length:len];  //as supplied, s is: (const XML_Char *), and len is its
    length
    //append a NULL unicode char
    XML_Char nullChar = 0;
    XML_Char *nullCharPtr = &nullChar;
    int nullCharLen = sizeof nullChar;
    [data appendBytes:nullCharPtr length:nullCharLen];
    NSString *str = [NSString stringWithUTF8String:(const char*)data];

    The idea being to append a NULL to the non-terminated library
    supplied wchar_t array, and then convert given its encoding.
    But that gives me nil (even though my test data is plain old ASCII
    caracters, so all the XML_Chars are single bytes).

    So, in experimenting, I tried this:

    NSMutableString *ms = [NSMutableString stringWithCapacity:len];
    int i;
    for (i = 0; i<len; i++) {
      [ms appendFormat:@"%C", s[i]];
    }
    NSLog(@"string is: %@", ms);

    That works, but obviously is quite inefficient.

    There must be a sensible way to do this?

    Many thanks in advance for whatever help someone(s) can provide.
    --Stuart
  • hi Stuart,

    you can simply use: [NSString stringWithUTF8String: s]. or [NSString
    stringWithCharacters:s length: len].

    > I tried this approach:
    >
    > NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    > length:len];  //as supplied, s is: (const XML_Char *), and len is
    > its length
    > //append a NULL unicode char
    > XML_Char nullChar = 0;
    > XML_Char *nullCharPtr = &nullChar;
    > int nullCharLen = sizeof nullChar;
    > [data appendBytes:nullCharPtr length:nullCharLen];
    > NSString *str = [NSString stringWithUTF8String:(const char*)data];

    NSData is an object, unlike in plain c this doesn't represent a part
    of the memory in a given way, instead it hides this information and
    provides methods to access the data. So you can't just cast a NSData
    object into a const char*.

    you could have used: [[NSString alloc] initWithData: data encoding:
    NSUTF8StringEncoding].

    Kind Regards
    Karsten

    > The idea being to append a NULL to the non-terminated library
    > supplied wchar_t array, and then convert given its encoding.
    > But that gives me nil (even though my test data is plain old ASCII
    > caracters, so all the XML_Chars are single bytes).
    >
    > So, in experimenting, I tried this:
    >
    > NSMutableString *ms = [NSMutableString stringWithCapacity:len];
    > int i;
    > for (i = 0; i<len; i++) {
    > [ms appendFormat:@"%C", s[i]];
    > }
    > NSLog(@"string is: %@", ms);
    >
    > That works, but obviously is quite inefficient.
    >
    > There must be a sensible way to do this?
    >
    > Many thanks in advance for whatever help someone(s) can provide.
    > --Stuart
  • On Mar 3, 2008, at 9:19 PM, Karsten wrote:

    > you can simply use: [NSString stringWithUTF8String: s]. or
    > [NSString stringWithCharacters:s length: len].

    Can't use the former because s is not terminated.

    The latter (with a cast):
    NSString *str = [NSString stringWithCharacters:(const unichar *)s
    length:len];

    doesn't work (must be an encoding discrepancy?). Source ASCII "hello"
    becomes: 敨汬㱯�愾

    >> I tried this approach:
    >>
    >> NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    >> length:len];  //as supplied, s is: (const XML_Char *), and len is
    >> its length
    >> //append a NULL unicode char
    >> XML_Char nullChar = 0;
    >> XML_Char *nullCharPtr = &nullChar;
    >> int nullCharLen = sizeof nullChar;
    >> [data appendBytes:nullCharPtr length:nullCharLen];
    >> NSString *str = [NSString stringWithUTF8String:(const char*)data];
    >
    > NSData is an object, unlike in plain c this doesn't represent a
    > part of the memory in a given way, instead it hides this
    > information and provides methods to access the data. So you can't
    > just cast a NSData object into a const char*.

    Yeesh -- I'm embarrassed: but of course.

    > you could have used: [[NSString alloc] initWithData: data encoding:
    > NSUTF8StringEncoding].

    Yes - thanks - that works:

    NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    length:len];
    //append a NULL unicode char
    XML_Char nullChar = 0;
    XML_Char *nullCharPtr = &nullChar;
    int nullCharLen = sizeof nullChar;
    [data appendBytes:nullCharPtr length:nullCharLen];
    NSString *str = [[[NSString alloc] initWithData:data
    encoding:NSUTF8StringEncoding] autorelease];
  • On 3 Mar '08, at 11:00 PM, Stuart Malin wrote:

    > The library defines an XML_Char type, so my code below refers to
    > that, but XML_Char is wchar_t (which, I believe is UTF8 on a Mac).

    No. It's UTF-16. (UTF-8 is an 8-bit encoding, a superset of ASCII
    where characters >127 are encoded as multiple bytes.)

    Use -[NSString initWithCharacters:length:].

    —Jens
  • On Mar 3, 2008, at 11:00 PM, Stuart Malin wrote:

    > My problem is that I receive a function call from a C library that
    > gives me a wchar_t array and its length. The unicode array is _not_
    > terminated.
    >
    > The library defines an XML_Char type, so my code below refers to
    > that, but XML_Char is wchar_t (which, I believe is UTF8 on a Mac).

    You actually have two problems here:

    1) wchar_t on the Mac is a 4 byte per character container (32 bits).

    2) wchar_t is just a container, it does not define the encoding of the
    character it contains.

    So, you need to know exactly what the encoding used in your container
    is before you can get it converted to an NSString of a known encoding.

    NSString infers the width of the characters from the encoding. If you
    have a buffer of characters where the width does not match the
    encoding you will probably have to re-buffer the characters into the
    correct width before handing them to NSString.

    If you are correct in that you have a piece of code that has UTF8 in a
    wchar_t string (which would be horribly inefficient, wasting 3 bytes
    per character in the string) you might need to write some code that
    copies every 4th byte from the wchar_t string into a UTF8 buffer that
    you can then use as input to NSString.

    Dave
  • On Mar 4, 2008, at 8:25 AM, Dave Camp wrote:
    >
    > You actually have two problems here:
    >
    > 1) wchar_t on the Mac is a 4 byte per character container (32 bits).

    Not quite correct. wchar_t, may, at this time, default to 4 bytes in
    an Xcode project, but it is *not* defined to be 4 bytes on the Mac. In
    fact, it is quite easy to make wchar_t be 2 bytes. Assumptions about
    the actual size of a wchar_t are probably a bug.

    --Brady
  • Brady Duga wrote:
    > On Mar 4, 2008, at 8:25 AM, Dave Camp wrote:
    >>
    >> You actually have two problems here:
    >>
    >> 1) wchar_t on the Mac is a 4 byte per character container (32 bits).
    >
    > Not quite correct. wchar_t, may, at this time, default to 4 bytes in
    > an Xcode project, but it is *not* defined to be 4 bytes on the Mac. In
    > fact, it is quite easy to make wchar_t be 2 bytes. Assumptions about
    > the actual size of a wchar_t are probably a bug.
    Yes and no.

    You can tell the compiler to treat wchar_t as 16-bit and it will happily
    comply.

    However, don't expect to be able to use any standard library calls in
    this mode. wcslen, wcscpy, wcscat? Roll your own. swprintf?
    Just forget about it.

    The limitations here make it difficult to leverage 16-bit wchar_t
    effectively, IMO.
  • On Mar 4, 2008, at 10:13 AM, John Stiles wrote:
    >
    > However, don't expect to be able to use any standard library calls
    > in this mode. wcslen, wcscpy, wcscat? Roll your own. swprintf?
    > Just forget about it.

    If you need to.

    >
    > The limitations here make it difficult to leverage 16-bit wchar_t
    > effectively, IMO.

    I use a fairly large library that is integrated with a Cocoa app that
    uses 16 bit wchar_t. It is an emulation environment, so having 16 bit
    wchar_t is important for testing purposes. Works fine. Also, any
    libraries that might be used across platforms needs to be wchar_t size
    agnostic, though in that case you can probably get away with assuming
    returned wchar_ts are 32 bit when on the Mac. In any case, it is
    generally best not to assume you know the size of a wchar_t (and there
    is rarely a need to assume it).

    --Brady
  • To be clear, I'm not trying to say that 16-bit wchar_t has no value.

    Honestly, I think Apple should have much better support for it. If
    you're porting an app from Windows, you probably have to deal with
    WCHARs in some fashion. I'd like to say I have a radar for it but I'm
    honestly not sure. It's been a few years since the last time I had to
    deal with 16-bit wchar_ts. At the time I remember needing to borrow
    large chunks of MSL from CodeWarrior, and getting it to compile on
    Xcode… bleck.

    Brady Duga wrote:
    >
    > On Mar 4, 2008, at 10:13 AM, John Stiles wrote:
    >>
    >> However, don't expect to be able to use any standard library calls in
    >> this mode. wcslen, wcscpy, wcscat? Roll your own. swprintf?
    >> Just forget about it.
    >
    > If you need to.
    >
    >>
    >> The limitations here make it difficult to leverage 16-bit wchar_t
    >> effectively, IMO.
    >
    > I use a fairly large library that is integrated with a Cocoa app that
    > uses 16 bit wchar_t. It is an emulation environment, so having 16 bit
    > wchar_t is important for testing purposes. Works fine. Also, any
    > libraries that might be used across platforms needs to be wchar_t size
    > agnostic, though in that case you can probably get away with assuming
    > returned wchar_ts are 32 bit when on the Mac. In any case, it is
    > generally best not to assume you know the size of a wchar_t (and there
    > is rarely a need to assume it).
    >
    > --Brady
  • On Tue, Mar 4, 2008 at 8:58 AM, Brady Duga <duga...> wrote:
    >
    > On Mar 4, 2008, at 8:25 AM, Dave Camp wrote:
    >>
    >> You actually have two problems here:
    >>
    >> 1) wchar_t on the Mac is a 4 byte per character container (32 bits).
    >
    > Not quite correct. wchar_t, may, at this time, default to 4 bytes in
    > an Xcode project, but it is *not* defined to be 4 bytes on the Mac.

    Actually, all of the standard C and C++ libraries on the Mac define
    wchar_t as 4 bytes. If you change it to be 2 bytes, then you will be
    unable to call any standard-library functions taking wchar_t (or a
    pointer thereto) as a parameter. For all intents and purposes, wchar_t
    is 4 bytes on the Mac (and will be on any platform that intends to put
    Unicode into wchar_t *and* support the C99 standard).

    > In fact, it is quite easy to make wchar_t be 2 bytes.

    If Apple were to change wchar_t to be 2 bytes instead of 4, it would
    break every single piece of software on the Mac that uses wchar_t.

    > Assumptions about the actual size of a wchar_t are probably a bug.

    There are some situations in which the C standard allows you to assume
    that wchar_t contains UTF-32/UCS-4:

    > From C99 (with TC2 applied):

        "__STDC_ISO_10646__
        An integer constant of the form yyyymmL (for example, 199712L). If
    this symbol is defined, then every character in the "Unicode required
    set", when stored in an object of type wchar_t, has the same value as
    the short identifier of that character. The "Unicode required set"
    consists of all the characters that are defined by ISO/IEC 10646,
    along with all amendments and technical corrigenda, as of the
    specified year and month."
    -----

    Therefore (since 2001/11 was the month in which the Unicode character
    set grew beyond 16-bit):

    #if defined(__STDC_ISO_10646__) && __STDC_ISO_10646__ >= 200111L
    //wchar_t must be UTF-32/UCS-4.
    #endif

    Of course, gcc on the Mac doesn't yet define __STDC_ISO_10646__, so my
    point is mostly academic, but there may come a time when it *is* safe
    to make assumptions about the size of wchar_t.
    --
    Clark S. Cox III
    <clarkcox3...>
  • You don't need the NSMutableData:

    NSString *str = [[[NSString alloc] initWithBytes:s length:len
    encoding:NSUTF8StringEncoding] autorelease];

    ...assuming it's really UTF-8. If it's wchar_t and you know it's
    Unicode, use NSUTF32StringEncoding (only available on 10.5 or later).
    If it's UTF-16, use stringWithCharacters:length:.

    Deborah Goldsmith
    Apple Inc.
    <goldsmit...>

    On Mar 3, 2008, at 11:44 PM, Stuart Malin wrote:

    > Yes - thanks - that works:
    >
    > NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    > length:len];
    > //append a NULL unicode char
    > XML_Char nullChar = 0;
    > XML_Char *nullCharPtr = &nullChar;
    > int nullCharLen = sizeof nullChar;
    > [data appendBytes:nullCharPtr length:nullCharLen];
    > NSString *str = [[[NSString alloc] initWithData:data
    > encoding:NSUTF8StringEncoding] autorelease];
  • Thank you everybody for your thoughts and suggestions.
    I hope some of the conversation that was stirred up was useful.

    NSString's initWithBytes:len:encoding: worked.
    It did so because I had a misunderstanding -- in most cases my source
    string is UTF-8 (not UTF-16), and the source string is byte wide, not
    wchar_t.

    Howvere, I do have cases where the content passes to me is not valid
    UTF-8, and further contains a non-7-bit ASCII character which is
    actually a separator character used to distinguish a set of joined
    UTF-8 strings... so  I wrote a routine to scan this byte sequence,
    identify start and end points, and then use NSString's
    initWithBytes:len:encoding: to encode the sections.

    Again, many thanks for the contributions made to help me.
    --Stuart

    On Mar 4, 2008, at 5:01 PM, Deborah Goldsmith wrote:

    > You don't need the NSMutableData:
    >
    > NSString *str = [[[NSString alloc] initWithBytes:s length:len
    > encoding:NSUTF8StringEncoding] autorelease];
    >
    > ...assuming it's really UTF-8. If it's wchar_t and you know it's
    > Unicode, use NSUTF32StringEncoding (only available on 10.5 or
    > later). If it's UTF-16, use stringWithCharacters:length:.
    >
    > Deborah Goldsmith
    > Apple Inc.
    > <goldsmit...>
    >
    > On Mar 3, 2008, at 11:44 PM, Stuart Malin wrote:
    >
    >> Yes - thanks - that works:
    >>
    >> NSMutableData *data = [NSMutableData dataWithBytes:(void *)s
    >> length:len];
    >> //append a NULL unicode char
    >> XML_Char nullChar = 0;
    >> XML_Char *nullCharPtr = &nullChar;
    >> int nullCharLen = sizeof nullChar;
    >> [data appendBytes:nullCharPtr length:nullCharLen];
    >> NSString *str = [[[NSString alloc] initWithData:data
    >> encoding:NSUTF8StringEncoding] autorelease];
    >
previous month march 2008 next month
MTWTFSS
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            
Go to today