Create NSStrings from a mapped NSData object - safe?

  • Salutations!

    I'm parsing a rather large text-file (usually >20MB) and in doing so
    I'm iterating over its lines with [String getParagraphStart::::]. I've
    found a rather noticeable speed-up in the parsing operation if I
    create the string in question from an NSData object (created via
    initWithContentsOfMappedFile) using [String initWithData:encoding:].

    Now to the questions:
    1) Is this safe if the file in question is being moved / deleted /
    edited during parsing?

    2) Are substrings created from the original string (e.g.
    substringWithRange etc.) still backed properly after the original
    string and the NSData object are released?

    Thanks for any pointers,
    Daniel.
  • On May 13, 2008, at 12:38 AM, Daniel Vollmer wrote:

    > Salutations!
    >
    > I'm parsing a rather large text-file (usually >20MB) and in doing so
    > I'm iterating over its lines with [String getParagraphStart::::].
    > I've found a rather noticeable speed-up in the parsing operation if
    > I create the string in question from an NSData object (created via
    > initWithContentsOfMappedFile) using [String initWithData:encoding:].
    >
    > Now to the questions:
    > 1) Is this safe if the file in question is being moved / deleted /
    > edited during parsing?

    No, the system might not map the entire file (depending on size and
    available resources or implementation choices) so when you read a new,
    not yet mapped, or previously discarded page it will come from the
    current state of the file, be it untouched, modified, or deleted.
    Basically there's no guarantee you'll get the same data that was in
    the file when you first mapped it if something else modifies or
    destroys it while it's mapped.

    > 2) Are substrings created from the original string (e.g.
    > substringWithRange etc.) still backed properly after the original
    > string and the NSData object are released?

    Yes, they're newly created and individual string objects.
  • On 12 May '08, at 11:38 PM, Daniel Vollmer wrote:

    > I'm parsing a rather large text-file (usually >20MB) and in doing so
    > I'm iterating over its lines with [String getParagraphStart::::].
    > I've found a rather noticeable speed-up in the parsing operation if
    > I create the string in question from an NSData object (created via
    > initWithContentsOfMappedFile) using [String initWithData:encoding:].

    It sounds like you're creating a single NSString containing the entire
    contents of the file, then?

    > Now to the questions:
    > 1) Is this safe if the file in question is being moved / deleted /
    > edited during parsing?

    The string initializer you're using copies the data. This might just
    involve calling -copy on the NSData instead of copying the bytes into
    a new buffer; I'm not sure.

    If the NSString made its own copy of the bytes, then you're totally
    safe; the data from the mapped file isn't being used at all anymore.

    If it's using the bytes in the NSData, you're "mostly" safe. Moving or
    deleting the mapped file won't break the mapping (a deleted file isn't
    actually deleted until all open file descriptors close.) A typical
    "safe-save" won't alter the data either, since it creates a new file
    and then deletes the old one. The only problem would be if something
    overwrote the file in place, in which case the overwritten data would
    suddenly show up in the NSData.

    > 2) Are substrings created from the original string (e.g.
    > substringWithRange etc.) still backed properly after the original
    > string and the NSData object are released?

    Yes. Even if the NSString is still using the NSData's contents for its
    buffer, it retained them, so releasing the NSData won't make it go
    away until the string is done with it.

    —Jens
  • On 13 May '08, at 12:15 AM, Michael Vannorsdel wrote:

    > Basically there's no guarantee you'll get the same data that was in
    > the file when you first mapped it if something else modifies or
    > destroys it while it's mapped.

    You're correct about modifications, but not about deletions. An open
    file descriptor counts as a link to a file, so the unlink(2) system
    call will not actually delete the file from disk because there's still
    a link to it. Once you close the mapped file, the last link goes away
    and then the file is actually deleted.

    This is sometimes used as a technique for creating anonymous temporary
    files — create a temp file with open(2), then unlink(2) it so it
    doesn't exist in the directory tree anymore, then its yours to write/
    read until you close(2) it.

    While we're on the topic, it's worth noting that memory-mapping files
    on removable or network filesystems can be dangerous. If you read/
    write a mapped memory location, and the kernel has to page it in, but
    the file's filesystem is no longer accessible due to a network issue
    or a yanked USB cable ... you get a bus-error. I've seen discussion of
    ways to handle this by installing a signal handler whenever you access
    mapped memory, but it would be pretty tricky to pull off. The
    conclusion was that it's only safe to memory-map files that are either
    (a) on the boot filesystem, or (b) in the user's home directory. (The
    latter might be on a networked filesystem, but if it ever gets
    disconnected, most of the upper layers of the OS and applications will
    be hosed anyway...)

    —Jens
  • On May 13, 2008, at 8:08 AM, Jens Alfke wrote:

    > While we're on the topic, it's worth noting that memory-mapping
    > files on removable or network filesystems can be dangerous. If you
    > read/write a mapped memory location, and the kernel has to page it
    > in, but the file's filesystem is no longer accessible due to a
    > network issue or a yanked USB cable ... you get a bus-error. I've
    > seen discussion of ways to handle this by installing a signal
    > handler whenever you access mapped memory, but it would be pretty
    > tricky to pull off. The conclusion was that it's only safe to memory-
    > map files that are either (a) on the boot filesystem, or (b) in the
    > user's home directory. (The latter might be on a networked
    > filesystem, but if it ever gets disconnected, most of the upper
    > layers of the OS and applications will be hosed anyway...)

    I would add, (c) files that are on the same volume as the executable;
    for example, files in your application bundle or one of your framework
    bundles should be safe to map in.  The reasoning here is that if a
    volume containing an executable your application is currently using
    goes away, the application is dead anyway--in fact, the executables
    themselves are usually mapped.  I'm not so certain about (b); it may
    be possible to survive the loss or temporary absence of the user's
    home directory, depending on what your application is doing.  However,
    the files I typically have wanted to map in have been fixed data files
    that are part of the system, the application, or one of its
    frameworks, for which the home directory would be an unlikely location
    anyway.

    Douglas Davidson
  • On May 13, 2008, at 17:00, Jens Alfke wrote:

    >
    > On 12 May '08, at 11:38 PM, Daniel Vollmer wrote:
    >
    >> I'm parsing a rather large text-file (usually >20MB) and in doing
    >> so I'm iterating over its lines with [String
    >> getParagraphStart::::]. I've found a rather noticeable speed-up in
    >> the parsing operation if I create the string in question from an
    >> NSData object (created via initWithContentsOfMappedFile) using
    >> [String initWithData:encoding:].
    >
    > It sounds like you're creating a single NSString containing the
    > entire contents of the file, then?

    Yes. Is that something I shouldn't do? I mean, I feel a tiny bit silly
    creating such huge strings but I didn't find a nice alternative (e.g.
    like the Ruby for each line iterators on file objects).

    >> 2) Are substrings created from the original string (e.g.
    >> substringWithRange etc.) still backed properly after the original
    >> string and the NSData object are released?
    >
    > Yes. Even if the NSString is still using the NSData's contents for
    > its buffer, it retained them, so releasing the NSData won't make it
    > go away until the string is done with it.

    But now that means that the strings are "endangered" from in-place
    file modification for the lifetime of my objects created during
    parsing, not just the initial parsing itself, correct?
    Also, it feels a bit silly to have a retain on the 20MB NSData object
    while I still hold references to about 5KB of string bytes from
    various places in the file. Usually all this "behind-the-scenes"
    storage retaining doesn't matter much, but I'd quite like to make sure
    I drop most of the 20MB once I'm done parsing. This question of course
    also applies if I'm not mapping the file and creating a String from it
    directly

    FWIW, my current iteration looks like this (String being the big 20MB
    one);

    NSUInteger length = [String length];
    NSUInteger paraStart = 0, paraEnd = 0, contentsEnd = 0;

    while (paraEnd < length)
    {
    [String getParagraphStart:&paraStart end:&paraEnd
    contentsEnd:&contentsEnd forRange:NSMakeRange(paraEnd, 0)];
    line = [String substringWithRange:NSMakeRange(paraStart, contentsEnd
    - paraStart)];
    // do lots of menial parsing of line
    }

    If I leave the mmaped reading in, it sounds like a sensible idea to
    check whether the file is on the same drive as the app. So thanks for
    that suggestion.

    Thanks for any further insight,
    Daniel.
  • On 13 May '08, at 12:55 PM, Daniel Vollmer wrote:

    >> It sounds like you're creating a single NSString containing the
    >> entire contents of the file, then?
    >
    > Yes. Is that something I shouldn't do? I mean, I feel a tiny bit
    > silly creating such huge strings but I didn't find a nice
    > alternative (e.g. like the Ruby for each line iterators on file
    > objects).

    Unfortunately streams are not Foundation's strong suit. You can use
    NSStream or NSFileHandle to read incrementally from a file, but the
    API's pretty low-level and you'll have to do things like decoding
    UTF-8 and parsing for line ends by yourself.

    > But now that means that the strings are "endangered" from in-place
    > file modification for the lifetime of my objects created during
    > parsing, not just the initial parsing itself, correct?

    The big 20MB string might be, yes. If you created any new NSStrings as
    substrings of it, I am pretty sure those have their own copies of the
    character data, so they should be immune.

    Note that even if you used a stream to read the file incrementally,
    you wouldn't be immune to something else modifying the file while you
    were reading it. So the effect isn't all that different. Just be sure
    to release and stop using the big 20MB string right after you finish
    scanning it.

    —Jens
  • I actually tested this a month back and not all operations/programs
    respect link counts nor does the system appear to enforce them.  For
    instance an rm -f will destroy the file regardless of link count, as
    well as some obscure APIs.  After the file was removed the mapping
    program crashed when trying to read more of the file; it failed to
    load new map data.

    On May 13, 2008, at 9:08 AM, Jens Alfke wrote:

    > You're correct about modifications, but not about deletions. An open
    > file descriptor counts as a link to a file, so the unlink(2) system
    > call will not actually delete the file from disk because there's
    > still a link to it. Once you close the mapped file, the last link
    > goes away and then the file is actually deleted.
  • Also I should add, I've yet to find a way to protect a file from
    editing or deletion on OS X that can't just be ignored by something
    else.  Things like flock appear to be optionally supported and not
    globally enforced.  As long as a way exists to get around any kind of
    file lock there's no way to guarantee a specific file will be
    unchanged during usage.  This is partially done to prevent file lock
    abuse I'm guessing.

    As Mr. Davidson pointed out, the best you can do is only map files you
    can be reasonably sure no one will want to modify unless purposely
    trying to corrupt your program.

    On May 13, 2008, at 9:57 PM, Michael Vannorsdel wrote:

    > I actually tested this a month back and not all operations/programs
    > respect link counts nor does the system appear to enforce them.  For
    > instance an rm -f will destroy the file regardless of link count, as
    > well as some obscure APIs.  After the file was removed the mapping
    > program crashed when trying to read more of the file; it failed to
    > load new map data.
    >
    >
    > On May 13, 2008, at 9:08 AM, Jens Alfke wrote:
    >
    >> You're correct about modifications, but not about deletions. An
    >> open file descriptor counts as a link to a file, so the unlink(2)
    >> system call will not actually delete the file from disk because
    >> there's still a link to it. Once you close the mapped file, the
    >> last link goes away and then the file is actually deleted.
    >
  • I suggest reading in the entire file into your NSData with
    initWithContentsOfFile: if there's a significant chance of file
    modification.  I know this sounds like a huge memory usage but this
    way you can know your data is static and the system is designed to
    handle high memory usage programs.  If there's some pages of your data
    you haven't touched in a while the system will swap those out and use
    the physical pages for something that needs them.

    When making substrings from the data you can save making redundant
    copies using initWithBytesNoCopy:length:encoding:freeWhenDone: for
    string objects that only reference the characters from the data.  They
    depend on the NSData object so it needs to be valid for these strings
    to work.  If you want permanent copies of some substrings you can use
    initWithString: to make them; I'm not sure if using copy will just
    make another object only referencing the data like the original.  If
    you make the substrings with any of NSString's substring* methods, the
    returned strings will be independent and won't become invalid if the
    NSData or base string are released.

    On May 13, 2008, at 1:55 PM, Daniel Vollmer wrote:

    > On May 13, 2008, at 17:00, Jens Alfke wrote:
    >
    >>
    >> On 12 May '08, at 11:38 PM, Daniel Vollmer wrote:
    >> It sounds like you're creating a single NSString containing the
    >> entire contents of the file, then?
    >
    > Yes. Is that something I shouldn't do? I mean, I feel a tiny bit
    > silly creating such huge strings but I didn't find a nice
    > alternative (e.g. like the Ruby for each line iterators on file
    > objects).
    >
    >> Yes. Even if the NSString is still using the NSData's contents for
    >> its buffer, it retained them, so releasing the NSData won't make it
    >> go away until the string is done with it.
    >
    > But now that means that the strings are "endangered" from in-place
    > file modification for the lifetime of my objects created during
    > parsing, not just the initial parsing itself, correct?
    > Also, it feels a bit silly to have a retain on the 20MB NSData
    > object while I still hold references to about 5KB of string bytes
    > from various places in the file. Usually all this "behind-the-
    > scenes" storage retaining doesn't matter much, but I'd quite like to
    > make sure I drop most of the 20MB once I'm done parsing. This
    > question of course also applies if I'm not mapping the file and
    > creating a String from it directly
    >
    >
    > FWIW, my current iteration looks like this (String being the big
    > 20MB one);
    >
    > NSUInteger length = [String length];
    > NSUInteger paraStart = 0, paraEnd = 0, contentsEnd = 0;
    >
    > while (paraEnd < length)
    > {
    > [String getParagraphStart:&paraStart end:&paraEnd
    > contentsEnd:&contentsEnd forRange:NSMakeRange(paraEnd, 0)];
    > line = [String substringWithRange:NSMakeRange(paraStart,
    > contentsEnd - paraStart)];
    > // do lots of menial parsing of line
    > }
    >
    > If I leave the mmaped reading in, it sounds like a sensible idea to
    > check whether the file is on the same drive as the app. So thanks
    > for that suggestion.
  • On 13 May '08, at 9:39 PM, Michael Vannorsdel wrote:

    > If there's some pages of your data you haven't touched in a while
    > the system will swap those out and use the physical pages for
    > something that needs them.

    Yes, but it's less efficient than a mapped file, which doesn't have to
    be swapped out at all.

    The OS may have virtual memory, but swapping when the system is under
    memory pressure is the chief performance problem in OS X; when I
    worked at Apple, the performance people drilled it into us that the
    most important optimization is saving memory. (For example, that's why
    Release builds use -Os by default instead of -O2.)

    > When making substrings from the data you can save making redundant
    > copies using initWithBytesNoCopy:length:encoding:freeWhenDone: for
    > string objects that only reference the characters from the data.
    > They depend on the NSData object so it needs to be valid for these
    > strings to work.

    This is tricky and dangerous. It's very difficult to predict object
    lifespans in a ref-counted or GC'd environment, and if any of those
    little strings are still being retained by something when you free the
    big string, they all turn into land mines that will crash the app the
    next time they're referenced.

    It's possible to make this work, but I would only try it as a last-
    ditch optimization if the sheer volume of copied strings was choking
    performance.

    (Or -initWithBytesNoCopy: might just copy the bytes anyway. It's only
    a hint, not a guarantee. I believe it can only use the raw bytes
    without copying if they're in UTF-16 or ascii or MacRoman encoding.)

    —Jens
  • On 13 May '08, at 8:57 PM, Michael Vannorsdel wrote:

    > I actually tested this a month back and not all operations/programs
    > respect link counts nor does the system appear to enforce them.  For
    > instance an rm -f will destroy the file regardless of link count, as
    > well as some obscure APIs.  After the file was removed the mapping
    > program crashed when trying to read more of the file; it failed to
    > load new map data.

    I'm surprised to hear that. If it's true, that would be some kind of
    OS bug. You should file a bug report, or at least bring it up on
    filesystem-dev, if you haven't already.

    —Jens
  • Obviously this is less efficient than mapping which is why mapping is
    still around.  But it's kinda the only other option when mapping and
    FS operations won't protect your mapped file.  My point was that the
    program and system will still run under heavy memory consumption,
    albeit slowly.  Just another case of shortcuts, reliability, vs
    performance.

    Indeed making string references can be tricky and requires attention
    to detail.  But it can help make up for the heavy memory consumption
    of the in-memory data, which was the intent of my suggestion (bulk up
    here, trim some there).  If you know what you're doing this can
    perform well, reliably, and be a fair trade off between performance
    and predictability.

    Until file protection locks are guaranteed, I would expect the
    workaround to be less efficient and less elegant.

    On May 13, 2008, at 11:32 PM, Jens Alfke wrote:

    > Yes, but it's less efficient than a mapped file, which doesn't have
    > to be swapped out at all.
    >
    > The OS may have virtual memory, but swapping when the system is
    > under memory pressure is the chief performance problem in OS X; when
    > I worked at Apple, the performance people drilled it into us that
    > the most important optimization is saving memory. (For example,
    > that's why Release builds use -Os by default instead of -O2.)
    >
    >>
    > This is tricky and dangerous. It's very difficult to predict object
    > lifespans in a ref-counted or GC'd environment, and if any of those
    > little strings are still being retained by something when you free
    > the big string, they all turn into land mines that will crash the
    > app the next time they're referenced.
    >
    > It's possible to make this work, but I would only try it as a last-
    > ditch optimization if the sheer volume of copied strings was choking
    > performance.
    >
  • I was going to with sample code to show it, but I didn't see any
    documentation that said file locks and link counts were guaranteed.

    The Finder (or its underlying API) was smart enough to check link
    counts and gave notice the file was inuse.  But rm just wiped it off
    the FS and left nothing to map from.  This tells me link counts and
    locks are not checked by every FS API nor is it enforced at the FS
    level.

    On May 13, 2008, at 11:33 PM, Jens Alfke wrote:

    > I'm surprised to hear that. If it's true, that would be some kind of
    > OS bug. You should file a bug report, or at least bring it up on
    > filesystem-dev, if you haven't already.
  • Am 14.05.2008 um 15:32 schrieb Michael Vannorsdel <mikevann...>:

    > I was going to with sample code to show it, but I didn't see any
    > documentation that said file locks and link counts were guaranteed.

    See man unlink:
    "[...] If that decrement reduces the link count of the file to zero,
    and no process has the file open, then all resources associated with
    the file are reclaimed. If one or more process have the file open
    when the last link is removed, the link is removed, but the removal
    of the file is delayed until all references to it have been closed."

    > The Finder (or its underlying API) was smart enough to check link
    > counts and gave notice the file was inuse.  But rm just wiped it off
    > the FS and left nothing to map from.  This tells me link counts and
    > locks are not checked by every FS API nor is it enforced at the FS
    > level.

    If rm actually removed an open file (not the directory entry but the
    file itsel) then that would be a serious bug and should be reported.

    Mike
    --
    Mike Fischer    Softwareentwicklung, EDV-Beratung
                                    Schulung, Vertrieb
    Note:            I read this list in digest mode!
          Send me a private copy for faster responses.
  • Le 14 mai 08 à 16:32, Mike Fischer a écrit :

    > Am 14.05.2008 um 15:32 schrieb Michael Vannorsdel
    > <mikevann...>:
    >
    >> I was going to with sample code to show it, but I didn't see any
    >> documentation that said file locks and link counts were guaranteed.
    >
    > See man unlink:
    > "[...] If that decrement reduces the link count of the file to zero,
    > and no process has the file open, then all resources associated with
    > the file are reclaimed. If one or more process have the file open
    > when the last link is removed, the link is removed, but the removal
    > of the file is delayed until all references to it have been closed."
    >
    >
    >
    >> The Finder (or its underlying API) was smart enough to check link
    >> counts and gave notice the file was inuse.  But rm just wiped it off
    >> the FS and left nothing to map from.  This tells me link counts and
    >> locks are not checked by every FS API nor is it enforced at the FS
    >> level.
    >
    > If rm actually removed an open file (not the directory entry but the
    > file itsel) then that would be a serious bug and should be reported.

    At least on HFS+ volume, if one process uses a file, rm does not
    remove it but move it somewhere nobody can find it until it is no
    longer used (there is some folder that cannot be access using path on
    HFS volumes). Then it is deleted.
  • It doesn't move the file -- it removes the entry for it in the
    directory.  Once the reference count for it go to 0, then it gets
    "removed" from the filesystem -- i.e, it's space on the filesystem
    gets marked as being available.

    dennis

    On Wed, May 14, 2008 at 11:05 AM, Jean-Daniel Dupas
    <devlists...> wrote:
    > At least on HFS+ volume, if one process uses a file, rm does not remove it
    > but move it somewhere nobody can find it until it is no longer used (there
    > is some folder that cannot be access using path on HFS volumes). Then it is
    > deleted.
  • On 15 May 08, at 09:42, Dennis Munsie wrote:
    > It doesn't move the file -- it removes the entry for it in the
    > directory.  Once the reference count for it go to 0, then it gets
    > "removed" from the filesystem -- i.e, it's space on the filesystem
    > gets marked as being available.

    On a standard UNIX filesystem, this is the case. HFS and HFS+ are a
    little weird, though - files don't exist separately from their
    directory entries, so hardlinked files and deleted-but-still-open
    files are stored in an inaccessible directory ("\0\0\0\0HFS+ Private
    Data").
previous month may 2008 next month
MTWTFSS
      1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29 30 31  
Go to today