Reading Very Large Files (300 Mb to 3 Gb)

  • I need to process very large files, and below is the program I use to do the work. I have run this program on data files from very small up to over 3 Gb in length. Much of my testing has been done with files in the 200 to 300 Mb size, and the program works fine at that size.

    However, when I move up files in the 2 to 4 Gb range, behavior changes. The program starts consuming great amounts of virtual memory, around 14 Gb, takes more than a half hour to run, and after the functional part of the program is over, it takes another half hour for the program to give back much of the virtual memory, and once the program does fully quit, it takes the operating system another 10 minutes or so of thrashing before the final amount of virtual memory is returned and the hard drive finally calms down.

    I've never processed such massive files before, but I am surprised by the behavior. As you will see I'm using memory mapped NSData, and once I start processing the data I simply proceed through the data from beginning to end, separating the data into newline-separated lines and processing the lines. That processing is simple, just breaking each line into vertical-bar separated fields, and putting some of those field values into dictionaries.

    If I am simply reading through memory mapped data like this, why does the program use about six times as much virtual memory as the amount of memory needed by the file itself; why does the virtual memory accumulate in the first place, since I never return to memory pages I have already read through, and why does it take three quarters of an hour for the system to calm down once again after the processing has finished.

    I hope someone with some experience dealing with very large files might see something pretty silly in this code and have a pointer of two to share.

    Thanks,

    Tom Wetmore,
    Chief Bottle Washer, DeadEnds Software
    ------------------------------------------------

    #import <Foundation/Foundation.h>

    static void processLine (NSString*);

    int main(int argc, const char * argv[])
    {
        @autoreleasepool {

            NSError* error;
            NSString* path = @"/Volumes/Iomega HDD/Data/data";
            NSData* data = [NSData dataWithContentsOfFile: path
                                                  options: NSDataReadingMappedAlways + NSDataReadingUncached
                                                    error: &error];
    NSUInteger length = [data length];
            const Byte* bytes = [data bytes];

            NSUInteger start = 0;
            NSUInteger end = 0;
            NSString* line;
            while (YES) {
                if (start >= length) break;
                while (end < length && bytes[end] != '\n') {
                    end++;
                }
                line = [[NSString alloc] initWithBytes: bytes + start length: end - start encoding: 4];
                processLine(line);
                start = end + 1;
                end = start;
            }
        }
        return 0;
    }

    void processLine (NSString* line)
    {
            ... break link into 74 vertical-bar separated fields ... and do simple things
    }
  • On Jul 26, 2012, at 8:20 PM, Thomas Wetmore <ttw4...> wrote:
    > I need to process very large files, and below is the program I use to do the work. I have run this program on data files from very small up to over 3 Gb in length. Much of my testing has been done with files in the 200 to 300 Mb size, and the program works fine at that size.
    >
    > However, when I move up files in the 2 to 4 Gb range, behavior changes. The program starts consuming great amounts of virtual memory, around 14 Gb, takes more than a half hour to run, and after the functional part of the program is over, it takes another half hour for the program to give back much of the virtual memory, and once the program does fully quit, it takes the operating system another 10 minutes or so of thrashing before the final amount of virtual memory is returned and the hard drive finally calms down.
    >
    > I've never processed such massive files before, but I am surprised by the behavior. As you will see I'm using memory mapped NSData, and once I start processing the data I simply proceed through the data from beginning to end, separating the data into newline-separated lines and processing the lines. That processing is simple, just breaking each line into vertical-bar separated fields, and putting some of those field values into dictionaries.
    >
    > If I am simply reading through memory mapped data like this, why does the program use about six times as much virtual memory as the amount of memory needed by the file itself; why does the virtual memory accumulate in the first place, since I never return to memory pages I have already read through, and why does it take three quarters of an hour for the system to calm down once again after the processing has finished.

    You should use the Allocations instrument to see what is hogging your memory.

    My guess is that the memory-mapped NSData is fine, but that your NSString and other code inside processLine() is allocating objects and not freeing them.

    One simple possibility is that you are creating lots autoreleased objects, but not cleaning up any autorelease pools so they don't get deallocated until you are all done. Try this:

          while (YES) {
            @autoreleasepool {
              if (start >= length) break;
              while (end < length && bytes[end] != '\n') {
                  end++;
              }
              line = [[NSString alloc] initWithBytes: bytes + start length: end - start encoding: 4];
              processLine(line);
              start = end + 1;
              end = start;
            }
          }

    (Also, if you are not using ARC then that NSString is leaking, which will also cost lots of memory.)

    --
    Greg Parker    <gparker...>    Runtime Wrangler
  • Greg,

    Thanks for the INSTANT answer! I added the auto-release pool inside the read loop and ran the program on the largest data file I have, 3.46 Gb. The program ran perfectly in just under nine minutes and never built up any virtual memory.

    In hindsight I am embarrassed I did not come to the answer myself, as I have a fairly good understanding of all the supported memory management models. ARC tends to make one stop worrying, which tends to make one stop thinking. No excuses, though. I was too dim to see it.

    Thanks again. You nailed it for me.

    Tom Wetmore

    On Jul 26, 2012, at 11:29 PM, Greg Parker wrote:

    > On Jul 26, 2012, at 8:20 PM, Thomas Wetmore <ttw4...> wrote:
    >> I need to process very large files, and below is the program I use to do the work. I have run this program on data files from very small up to over 3 Gb in length. Much of my testing has been done with files in the 200 to 300 Mb size, and the program works fine at that size.
    >>
    >> However, when I move up files in the 2 to 4 Gb range, behavior changes. The program starts consuming great amounts of virtual memory, around 14 Gb, takes more than a half hour to run, and after the functional part of the program is over, it takes another half hour for the program to give back much of the virtual memory, and once the program does fully quit, it takes the operating system another 10 minutes or so of thrashing before the final amount of virtual memory is returned and the hard drive finally calms down.
    >>
    >> I've never processed such massive files before, but I am surprised by the behavior. As you will see I'm using memory mapped NSData, and once I start processing the data I simply proceed through the data from beginning to end, separating the data into newline-separated lines and processing the lines. That processing is simple, just breaking each line into vertical-bar separated fields, and putting some of those field values into dictionaries.
    >>
    >> If I am simply reading through memory mapped data like this, why does the program use about six times as much virtual memory as the amount of memory needed by the file itself; why does the virtual memory accumulate in the first place, since I never return to memory pages I have already read through, and why does it take three quarters of an hour for the system to calm down once again after the processing has finished.
    >
    > You should use the Allocations instrument to see what is hogging your memory.
    >
    > My guess is that the memory-mapped NSData is fine, but that your NSString and other code inside processLine() is allocating objects and not freeing them.
    >
    > One simple possibility is that you are creating lots autoreleased objects, but not cleaning up any autorelease pools so they don't get deallocated until you are all done. Try this:
    >
    > while (YES) {
    > @autoreleasepool {
    > if (start >= length) break;
    > while (end < length && bytes[end] != '\n') {
    > end++;
    > }
    > line = [[NSString alloc] initWithBytes: bytes + start length: end - start encoding: 4];
    > processLine(line);
    > start = end + 1;
    > end = start;
    > }
    > }
    >
    > (Also, if you are not using ARC then that NSString is leaking, which will also cost lots of memory.)
    >
    >
    > --
    > Greg Parker    <gparker...>    Runtime Wrangler
    >
    >
previous month july 2012 next month
MTWTFSS
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          
Go to today