Odd NSData initWithContentsOfURL: issue

  • Hey all,

    I'm trying to grab XML files from various URLs and save the XML file to
    disk as a text file.  Here is the little snippet of code I'm using:

    NSURL *xmlURL = [[NSURL alloc] initWithString:feedURL];
    NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    [theXMLFile writeToFile:@"/tmp/blaa.xml" atomically:TRUE];

    Ok, so all it should do, given the URL as the NSString 'feedURL', is
    grab the data using that URL, then write it out to /tmp/blaa.xml.

    It works for most URLs I try.  But there are some that result in binary
    garbage data in the blaa.xml file!  I'm wondering if anyone has any
    clue where I can look to start figuring this out.

    For example, this XML file works fine:
      http://www.coverville.com/index.xml
    But this one doesn't:
      http://vinylpodcast.com/wp-rss2.php

    If I load 'em up in Firefox or Safari and view source, they both look
    like valid XML text files.  The first is saved as a valid XML text
    file.  The second is saved as binary junk.  The .php extension
    shouldn't matter, since that is a server-side scripting that just sends
    back a valid XML file.

    Any ideas at all???

    Thanks a lot in advance!
    -Mike

    --
    Sent from my .mac account
  • Yes, I know exactly what's going on. The server is returning it as a
    gzipped file. initWithContentsOfURL is obviously saying it can handle
    gzip, but it's not decompressing it for you. This is surprising -
    either it shouldn't say it can handle gzip, or it should decompress it
    for you.

    I suggest passing any data you get back through a gzip decompressor
    (you could use libz directly, or you could see if anybody's written a
    Cocoa wrapper). If it's not gzip data, libz returns it unmodified, and
    if it is, libz returns the correct data.

    On Mar 13, 2005, at 10:27 PM, Michael J. Sherman wrote:

    > I'm trying to grab XML files from various URLs and save the XML file
    > to disk as a text file.  Here is the little snippet of code I'm using:
    >
    > NSURL *xmlURL = [[NSURL alloc] initWithString:feedURL];
    > NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    > [theXMLFile writeToFile:@"/tmp/blaa.xml" atomically:TRUE];
    >
    > Ok, so all it should do, given the URL as the NSString 'feedURL', is
    > grab the data using that URL, then write it out to /tmp/blaa.xml.
    >
    > It works for most URLs I try.  But there are some that result in
    > binary garbage data in the blaa.xml file!  I'm wondering if anyone
    > has any clue where I can look to start figuring this out.
    >
    > For example, this XML file works fine:
    > http://www.coverville.com/index.xml
    > But this one doesn't:
    > http://vinylpodcast.com/wp-rss2.php
    >
    > If I load 'em up in Firefox or Safari and view source, they both look
    > like valid XML text files.  The first is saved as a valid XML text
    > file.  The second is saved as binary junk.  The .php extension
    > shouldn't matter, since that is a server-side scripting that just
    > sends back a valid XML file.
    >
    > Any ideas at all???

    --
    Kevin Ballard
    <kevin...>
    http://www.tildesoft.com
    http://kevin.sb.org
  • On 14 Mar 2005, at 3:27, Michael J. Sherman wrote:
    > I'm trying to grab XML files from various URLs and save the XML file
    > to disk as a text file.
    ...
    > NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    ...
    > It works for most URLs I try.  But there are some that result in
    > binary garbage data in the blaa.xml file!  I'm wondering if anyone has
    > any clue where I can look to start figuring this out.

    I ran into the same problem the other day but with the
    initWithContentsOfURL: message in NSXMLParser instead of NSData.
    Another site that fails is the C|Net RSS feed at:
    http://news.com.com/2547-1_3-0-5.xml  At the time I didn't have enough
    time to fully test things and then I went off on holiday but seeing
    your mail made me come back and look at this again.

    It seems that there is a real bug in the initWithContentsOfURL: code
    for NSData and NSString and by extension every class that uses them.  A
    quick check of the headers sent with the request shows that they
    include:
    Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    The problem is that when the server does send back gziped data it's not
    getting unpacked.  I just tried your code to download from the site you
    mentioned and from CNet and the resulting files unpack just fine on the
    command line using gunzip.

    I will file a bug report with Apple.  Given that 10.3.9 is in seed
    already I guess it won't get fixed before Tiger :-(

    Nicko
  • On Mar 13, 2005, at 10:01 PM, Nicko van Someren wrote:

    > On 14 Mar 2005, at 3:27, Michael J. Sherman wrote:
    >> I'm trying to grab XML files from various URLs and save the XML
    >> file to disk as a text file.
    > ...
    >> NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    > ...
    >> It works for most URLs I try.  But there are some that result in
    >> binary garbage data in the blaa.xml file!  I'm wondering if anyone
    >> has any clue where I can look to start figuring this out.
    >
    > I ran into the same problem the other day but with the
    > initWithContentsOfURL: message in NSXMLParser instead of NSData.
    > Another site that fails is the C|Net RSS feed at: http://
    > news.com.com/2547-1_3-0-5.xml  At the time I didn't have enough
    > time to fully test things and then I went off on holiday but seeing
    > your mail made me come back and look at this again.
    >
    > It seems that there is a real bug in the initWithContentsOfURL:
    > code for NSData and NSString and by extension every class that uses
    > them.  A quick check of the headers sent with the request shows
    > that they include:
    > Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    > The problem is that when the server does send back gziped data it's
    > not getting unpacked.  I just tried your code to download from the
    > site you mentioned and from CNet and the resulting files unpack
    > just fine on the command line using gunzip.
    >
    > I will file a bug report with Apple.  Given that 10.3.9 is in seed
    > already I guess it won't get fixed before Tiger :-(

    That would be a feature request, rather than a bug report.  -
    initWithContentsOfURL: is supposed to give you whatever's at that
    URL, without any modifications.

    -jcr

    John C. Randolph <jcr...> (408) 974-8819
    Sr. Cocoa Software Engineer,
    Apple Worldwide Developer Relations
    http://developer.apple.com/cocoa/index.html
  • On Mar 15, 2005, at 11:46 AM, John C. Randolph wrote:

    >
    > On Mar 13, 2005, at 10:01 PM, Nicko van Someren wrote:
    >
    >> On 14 Mar 2005, at 3:27, Michael J. Sherman wrote:
    >>> I'm trying to grab XML files from various URLs and save the XML
    >>> file to disk as a text file.
    >> ...
    >>> NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    >> ...
    >>> It works for most URLs I try.  But there are some that result in
    >>> binary garbage data in the blaa.xml file!  I'm wondering if anyone
    >>> has any clue where I can look to start figuring this out.
    >>
    >> I ran into the same problem the other day but with the
    >> initWithContentsOfURL: message in NSXMLParser instead of NSData.
    >> Another site that fails is the C|Net RSS feed at:
    >> http://news.com.com/2547-1_3-0-5.xml  At the time I didn't have
    >> enough time to fully test things and then I went off on holiday but
    >> seeing your mail made me come back and look at this again.
    >>
    >> It seems that there is a real bug in the initWithContentsOfURL: code
    >> for NSData and NSString and by extension every class that uses them.
    >> A quick check of the headers sent with the request shows that they
    >> include:
    >> Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    >> The problem is that when the server does send back gziped data it's
    >> not getting unpacked.  I just tried your code to download from the
    >> site you mentioned and from CNet and the resulting files unpack just
    >> fine on the command line using gunzip.
    >>
    >> I will file a bug report with Apple.  Given that 10.3.9 is in seed
    >> already I guess it won't get fixed before Tiger :-(
    >
    > That would be a feature request, rather than a bug report.
    > -initWithContentsOfURL: is supposed to give you whatever's at that
    > URL, without any modifications.

    I think most users would argue that the compression of the data is an
    artifact of the transport mechanism, and does not represent the actual
    data at that URL. And you should not be explicitly advertising gzip
    support to the server on the client's behalf if you aren't going to
    take responsibility for the actual gzip decompression.
  • On Mar 15, 2005, at 2:46 PM, John C. Randolph wrote:

    >
    > On Mar 13, 2005, at 10:01 PM, Nicko van Someren wrote:
    >
    >> On 14 Mar 2005, at 3:27, Michael J. Sherman wrote:
    >>> I'm trying to grab XML files from various URLs and save the XML file
    >>> to disk as a text file.
    >> ...
    >>> NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    >> ...
    >>> It works for most URLs I try.  But there are some that result in
    >>> binary garbage data in the blaa.xml file!  I'm wondering if anyone
    >>> has any clue where I can look to start figuring this out.
    >>
    >> I ran into the same problem the other day but with the
    >> initWithContentsOfURL: message in NSXMLParser instead of NSData.
    >> Another site that fails is the C|Net RSS feed at:
    >> http://news.com.com/2547-1_3-0-5.xml  At the time I didn't have
    >> enough time to fully test things and then I went off on holiday but
    >> seeing your mail made me come back and look at this again.
    >>
    >> It seems that there is a real bug in the initWithContentsOfURL: code
    >> for NSData and NSString and by extension every class that uses them.
    >> A quick check of the headers sent with the request shows that they
    >> include:
    >> Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    >> The problem is that when the server does send back gziped data it's
    >> not getting unpacked.  I just tried your code to download from the
    >> site you mentioned and from CNet and the resulting files unpack just
    >> fine on the command line using gunzip.
    >>
    >> I will file a bug report with Apple.  Given that 10.3.9 is in seed
    >> already I guess it won't get fixed before Tiger :-(
    >
    > That would be a feature request, rather than a bug report.
    > -initWithContentsOfURL: is supposed to give you whatever's at that
    > URL, without any modifications.
    >

    But the library is advertising to the remote server the fact that it
    can handle gzip, but it never bothers to unzip it when it gets it.
    What am I supposed to do with the gzipped return data?  Do you have a
    solution?
  • On Mar 15, 2005, at 7:06 PM, Michael J. Sherman wrote:

    >
    > On Mar 15, 2005, at 2:46 PM, John C. Randolph wrote:
    >
    >>
    >> On Mar 13, 2005, at 10:01 PM, Nicko van Someren wrote:
    >>
    >>> On 14 Mar 2005, at 3:27, Michael J. Sherman wrote:
    >>>> I'm trying to grab XML files from various URLs and save the XML
    >>>> file to disk as a text file.
    >>> ...
    >>>> NSData *theXMLFile = [[NSData alloc] initWithContentsOfURL:xmlURL];
    >>> ...
    >>>> It works for most URLs I try.  But there are some that result in
    >>>> binary garbage data in the blaa.xml file!  I'm wondering if
    >>>> anyone has any clue where I can look to start figuring this out.
    >>>
    >>> I ran into the same problem the other day but with the
    >>> initWithContentsOfURL: message in NSXMLParser instead of NSData.
    >>> Another site that fails is the C|Net RSS feed at: http://
    >>> news.com.com/2547-1_3-0-5.xml  At the time I didn't have enough
    >>> time to fully test things and then I went off on holiday but
    >>> seeing your mail made me come back and look at this again.
    >>>
    >>> It seems that there is a real bug in the initWithContentsOfURL:
    >>> code for NSData and NSString and by extension every class that
    >>> uses them.  A quick check of the headers sent with the request
    >>> shows that they include:
    >>> Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    >>> The problem is that when the server does send back gziped data
    >>> it's not getting unpacked.  I just tried your code to download
    >>> from the site you mentioned and from CNet and the resulting files
    >>> unpack just fine on the command line using gunzip.
    >>>
    >>> I will file a bug report with Apple.  Given that 10.3.9 is in
    >>> seed already I guess it won't get fixed before Tiger :-(
    >>
    >> That would be a feature request, rather than a bug report.  -
    >> initWithContentsOfURL: is supposed to give you whatever's at that
    >> URL, without any modifications.
    >>
    >
    > But the library is advertising to the remote server the fact that
    > it can handle gzip, but it never bothers to unzip it when it gets
    > it.  What am I supposed to do with the gzipped return data?  Do you
    > have a solution?
    >

    I would suggest getting zlib, (http://www.gzip.org/zlib/) and using
    the inflate() function, or using an NSTask to invoke gunzip.

    -jcr

    John C. Randolph <jcr...> (408) 974-8819
    Sr. Cocoa Software Engineer,
    Apple Worldwide Developer Relations
    http://developer.apple.com/cocoa/index.html
  • > I would suggest getting zlib, (http://www.gzip.org/zlib/) and using
    > the inflate() function, or using an NSTask to invoke gunzip.

    Zlib should be installed by default. On my system without any
    modifications, both zilb and bzlib are installed. The zlib library is
    pretty easy to learn, so I would think that invoking gunzip (using
    NSTask) would be unneccesary. The documentation for zlib (and bzlib is
    even worse) is pretty limited, but a little experimentation tends to
    reveal all that's necessary.

    I hope this helps,
    Will
  • On Mar 16, 2005, at 12:38 AM, Will Mason wrote:

    >
    >> I would suggest getting zlib, (http://www.gzip.org/zlib/) and using
    >> the inflate() function, or using an NSTask to invoke gunzip.
    >
    > Zlib should be installed by default. On my system without any
    > modifications, both zilb and bzlib are installed.

    Now that you mention it, I seem to recall that gunzip uses zlib, so I
    should have known it was there..

    -jcr

    John C. Randolph <jcr...> (408) 974-8819
    Sr. Cocoa Software Engineer,
    Apple Worldwide Developer Relations
    http://developer.apple.com/cocoa/index.html
  • On 15 Mar 2005, at 20:01, John Stiles wrote:
    > On Mar 15, 2005, at 11:46 AM, John C. Randolph wrote:
    >> On Mar 13, 2005, at 10:01 PM, Nicko van Someren wrote:
    ...
    >>> It seems that there is a real bug in the initWithContentsOfURL: code
    >>> for NSData and NSString and by extension every class that uses them.
    >>> A quick check of the headers sent with the request shows that they
    >>> include:
    >>> Accept-Encoding: gzip, deflate;q=1.0, identity;q=0.5, *;q=0
    ...
    >>> I will file a bug report with Apple.  Given that 10.3.9 is in seed
    >>> already I guess it won't get fixed before Tiger :-(
    >>
    >> That would be a feature request, rather than a bug report.
    >> -initWithContentsOfURL: is supposed to give you whatever's at that
    >> URL, without any modifications.

    The problem is is does not do that.  There is no compressed data at
    that URL.

    > I think most users would argue that the compression of the data is an
    > artifact of the transport mechanism, and does not represent the actual
    > data at that URL. And you should not be explicitly advertising gzip
    > support to the server on the client's behalf if you aren't going to
    > take responsibility for the actual gzip decompression.

    Absolutely.  The data is not compressed on the server and it is only
    being compressed in transit at the (erroneous) request of the client.
    If you make an access to the same URL using, say, wget, you get the raw
    data.  If you access the URL with Safari and display the source you get
    the raw data.  If you access the URL with initWithContentsOfURL your
    get the compressed transport stream and not the data at the URL.

    The compression of the data is an un-requested artefact of the manner
    in which initWithContentsOfURL asks for the data.  There is no way for
    the user to know in advance if the server supports compression and
    therefore will be effected by the Accept-encoding: header line that
    gets sent.  Do you honestly think that this is a feature, let alone a
    useful one, and not a honest-to-goodness bug?

    Nicko