Garbage collection, core data, and tight loops

  • Greetings -

    I'm trying to process an enormous XML file that should ultimately fill
    a Core Data sqlite store with millions of records.  This should only
    happen once, so i'm willing to let the app go non-responsive; i'm
    doing all the parsing in an extremely tight loop.  My first time
    through, i did things with manual memory management, but the memory
    use kept going up.  So, i switched to mandatory Garbage Collection,
    leaving my autorelease pool statements in place, since they should be
    converted to no-ops.  I confirmed that a collector thread existed when
    i launched the app.

    It still ran out of memory:

    malloc: *** auto malloc[333]: Zone::Can not allocate 0x8200000 bytes

    I'm assuming the problem is an interaction between the tight loop and
    either Core Data or the Garbage Collector:
    The GC doesn't recognize which objects to collect while the loop is
    still chugging away.
    The sqlite store can't send objects out to disk and release their
    memory while the loop is still chugging away.

    I'm leaning towards the latter explanation.

    I'm going to redesign things so that the main runloop has a chance to
    act sporadically, but i was just curious as to which of these is a
    problem so that i can design specifically to avoid this in the future.

    Thanks,

    John
  • On Oct 30, 2007, at 2:03 PM, John R.Timmer wrote:
    > I'm trying to process an enormous XML file that should ultimately
    > fill a Core Data sqlite store with millions of records.  This should
    > only happen once, so i'm willing to let the app go non-responsive;
    > i'm doing all the parsing in an extremely tight loop.  My first time
    > through, i did things with manual memory management, but the memory
    > use kept going up.  So, i switched to mandatory Garbage Collection,
    > leaving my autorelease pool statements in place, since they should
    > be converted to no-ops.  I confirmed that a collector thread existed
    > when i launched the app.
    >
    > It still ran out of memory:
    >
    > malloc: *** auto malloc[333]: Zone::Can not allocate 0x8200000 bytes
    >
    > I'm assuming the problem is an interaction between the tight loop
    > and either Core Data or the Garbage Collector:
    > The GC doesn't recognize which objects to collect while the loop is
    > still chugging away.
    > The sqlite store can't send objects out to disk and release their
    > memory while the loop is still chugging away.
    >
    > I'm leaning towards the latter explanation.
    >
    > I'm going to redesign things so that the main runloop has a chance
    > to act sporadically, but i was just curious as to which of these is
    > a problem so that i can design specifically to avoid this in the
    > future.

    This sounds like you might have a situation where allocations are
    outrunning the collector (a known issue that is being worked on).
    Specifically, if you are allocating tons and tons of memory in a tight
    loop, the collector can't scan and purge the garbage quickly enough
    and, thus, you can cause this kind of behavior (there is more
    information on this in the documentation).

    However, that could easily not be the case -- I would need more
    information to know.

    Specifically -- are you periodically saving changes in your loop?

    Are you asking Core Data to purge objects?

    Does your object graph allow partial subgraphs to exist in memory
    without eventually faulting everything into memory?

    ... do you have any kind of caches or hashes that are effectively
    keeping the objects around?

    b.bum
  • On Oct 30, 2007, at 5:56 PM, Bill Bumgarner wrote:
    > This sounds like you might have a situation where allocations are
    > outrunning the collector (a known issue that is being worked on).
    > Specifically, if you are allocating tons and tons of memory in a
    > tight loop, the collector can't scan and purge the garbage quickly
    > enough and, thus, you can cause this kind of behavior (there is more
    > information on this in the documentation).
    >
    > However, that could easily not be the case -- I would need more
    > information to know.
    >
    > Specifically -- are you periodically saving changes in your loop?

    I have been periodically saving changes in the loop.

    > Are you asking Core Data to purge objects?
    I neglected to manually purge the objects - i assume that would
    involve using the ManagedObjectContext's -reset method?

    > Does your object graph allow partial subgraphs to exist in memory
    > without eventually faulting everything into memory?
    At this point, I'm only importing a single type of managed object to
    see how well things work with millions of records, and to make sure
    the project's feasible as designed before going further, so this is a
    non-issue.

    > ... do you have any kind of caches or hashes that are effectively
    > keeping the objects around?

    No, nothing should be retaining the objects - i'm setting values and
    moving on.

    I'll give it another shot with resetting the ManagedObjectContext and
    see if that does the trick.

    Thanks,

    John

    > On Oct 30, 2007, at 2:03 PM, John R.Timmer wrote:
    >> I'm trying to process an enormous XML file that should ultimately
    >> fill a Core Data sqlite store with millions of records.  This
    >> should only happen once, so i'm willing to let the app go non-
    >> responsive; i'm doing all the parsing in an extremely tight loop.
    >> My first time through, i did things with manual memory management,
    >> but the memory use kept going up.  So, i switched to mandatory
    >> Garbage Collection, leaving my autorelease pool statements in
    >> place, since they should be converted to no-ops.  I confirmed that
    >> a collector thread existed when i launched the app.
    >>
    >> It still ran out of memory:
    >>
    >> malloc: *** auto malloc[333]: Zone::Can not allocate 0x8200000 bytes
    >>
    >> I'm assuming the problem is an interaction between the tight loop
    >> and either Core Data or the Garbage Collector:
    >> The GC doesn't recognize which objects to collect while the loop is
    >> still chugging away.
    >> The sqlite store can't send objects out to disk and release their
    >> memory while the loop is still chugging away.
    >>
    >> I'm leaning towards the latter explanation.
    >>
    >> I'm going to redesign things so that the main runloop has a chance
    >> to act sporadically, but i was just curious as to which of these is
    >> a problem so that i can design specifically to avoid this in the
    >> future.
    >
    > This sounds like you might have a situation where allocations are
    > outrunning the collector (a known issue that is being worked on).
    > Specifically, if you are allocating tons and tons of memory in a
    > tight loop, the collector can't scan and purge the garbage quickly
    > enough and, thus, you can cause this kind of behavior (there is more
    > information on this in the documentation).
    >
    > However, that could easily not be the case -- I would need more
    > information to know.
    >
    > Specifically -- are you periodically saving changes in your loop?
    >
    > Are you asking Core Data to purge objects?
    >
    > Does your object graph allow partial subgraphs to exist in memory
    > without eventually faulting everything into memory?
    >
    > ... do you have any kind of caches or hashes that are effectively
    > keeping the objects around?
    >
    > b.bum
    >
  • On Oct 31, 2007, at 5:37 AM, John R. Timmer wrote:
    > I have been periodically saving changes in the loop.
    >> Are you asking Core Data to purge objects?
    > I neglected to manually purge the objects - i assume that would
    > involve using the ManagedObjectContext's -reset method?
    >> Does your object graph allow partial subgraphs to exist in memory
    >> without eventually faulting everything into memory?
    > At this point, I'm only importing a single type of managed object to
    > see how well things work with millions of records, and to make sure
    > the project's feasible as designed before going further, so this is
    > a non-issue.
    >> ... do you have any kind of caches or hashes that are effectively
    >> keeping the objects around?
    > No, nothing should be retaining the objects - i'm setting values and
    > moving on.
    > I'll give it another shot with resetting the ManagedObjectContext
    > and see if that does the trick.

    Ben will hopefully pipe up w/some info on how to do that.

    I would also suggest that you move to a SAX parser, regardless, if you
    haven't already done so.  It'll be much faster and memory efficient
    than DOM.

    b.bum
  • At 8:31 AM -0700 10/31/07, Bill Bumgarner wrote:
    > On Oct 31, 2007, at 5:37 AM, John R. Timmer wrote:
    >> I have been periodically saving changes in the loop.
    >>> Are you asking Core Data to purge objects?
    >> I neglected to manually purge the objects - i assume that would
    >> involve using the ManagedObjectContext's -reset method?
    >>> Does your object graph allow partial subgraphs to exist in memory
    >>> without eventually faulting everything into memory?
    >> At this point, I'm only importing a single type of managed object
    >> to see how well things work with millions of records, and to make
    >> sure the project's feasible as designed before going further, so
    >> this is a non-issue.
    >>> ... do you have any kind of caches or hashes that are
    >>> effectively keeping the objects around?
    >> No, nothing should be retaining the objects - i'm setting values
    >> and moving on.
    >> I'll give it another shot with resetting the ManagedObjectContext
    >> and see if that does the trick.
    >
    >
    > Ben will hopefully pipe up w/some info on how to do that.

    You can find information about memory management with Core Data in
    the Core Data Programming Guide.

    You will want to disable the MOC's undo manager for both background
    and batch operations.

    Objects with relationships are often caught in retain cycles.  These
    can be broken using -refreshObject:mergeChanges:NO or -reset (which
    invalidates all managed objects that context is observing) or
    releasing (deallocate) the MOC (which also invalidates all the
    managed objects its observing).  This problem goes away under GC as
    both the MO and its MOC have, effectively, weak references to each
    other.

    However, there is a complication.  Although we discourage it, under
    some circumstances it is permissible to pass a MOC to another thread.
    So the deallocation (or finalization) of managed objects must be
    thread safe.  Because these objects are wired into a graph of
    objects, and have framework resources, efficiently managing this
    thread safety is quite complicated.

    A side effect of thread safe deallocation of managed objects is that
    deallocation is deferred until the MOC can safely clean up state.
    The MOC maintains a queue of pending deallocations, which is handles
    during various operations like fetching, saving, the end of the
    event, etc.  Of course, resetting or releasing the MOC also purges
    any pending deallocations.

    You can manually force the MOC to poll with -processPendingChanges.

    Under the retain model, objects fetched are always in the autorelease
    pool, as well.
    --

    -Ben
  • Okay, so for completeness, i thought i'd share the resolution of this
    with the list, in case someone finds this thread in the future.

    I set the app up to create small batches of ManagedObjects (several
    hundred at a time) and added a small pause (0.01 sec.) between batches
    via a delayed selector.

    Before creating any ManagedObjects, i cancelled undo registration.

    Once a batch was created, i processed pending changes, saved, and
    reset the managedObjectContext.
    incidentally, resetting appears to create a brand new undo manager
    for the context, so i disabled each time; there was no need to re-
    enable when done.

    Using an autorelease pool for each batch worked well, keeping memory
    use extremely low.

    Using garbage collection resulted in a significant memory gain, but
    nowhere near bad enough to crash the program.  Oddly, the memory use
    did not subside after the loop had finished.

    So I'll probably stick to manually handling memory, at least for the
    import app.

    JT

    On Oct 31, 2007, at 10:41 PM, Ben Trumbull wrote:

    > At 8:31 AM -0700 10/31/07, Bill Bumgarner wrote:
    >> On Oct 31, 2007, at 5:37 AM, John R. Timmer wrote:
    >>> I have been periodically saving changes in the loop.
    >>>> Are you asking Core Data to purge objects?
    >>> I neglected to manually purge the objects - i assume that would
    >>> involve using the ManagedObjectContext's -reset method?
    >>>> Does your object graph allow partial subgraphs to exist in memory
    >>>> without eventually faulting everything into memory?
    >>> At this point, I'm only importing a single type of managed object
    >>> to see how well things work with millions of records, and to make
    >>> sure the project's feasible as designed before going further, so
    >>> this is a non-issue.
    >>>> ... do you have any kind of caches or hashes that are
    >>>> effectively keeping the objects around?
    >>> No, nothing should be retaining the objects - i'm setting values
    >>> and moving on.
    >>> I'll give it another shot with resetting the ManagedObjectContext
    >>> and see if that does the trick.
    >>
    >>
    >> Ben will hopefully pipe up w/some info on how to do that.
    >
    > You can find information about memory management with Core Data in
    > the Core Data Programming Guide.
    >
    > You will want to disable the MOC's undo manager for both background
    > and batch operations.
    >
    > Objects with relationships are often caught in retain cycles.  These
    > can be broken using -refreshObject:mergeChanges:NO or -reset (which
    > invalidates all managed objects that context is observing) or
    > releasing (deallocate) the MOC (which also invalidates all the
    > managed objects its observing).  This problem goes away under GC as
    > both the MO and its MOC have, effectively, weak references to each
    > other.
    >
    > However, there is a complication.  Although we discourage it, under
    > some circumstances it is permissible to pass a MOC to another
    > thread. So the deallocation (or finalization) of managed objects
    > must be thread safe.  Because these objects are wired into a graph
    > of objects, and have framework resources, efficiently managing this
    > thread safety is quite complicated.
    >
    > A side effect of thread safe deallocation of managed objects is that
    > deallocation is deferred until the MOC can safely clean up state.
    > The MOC maintains a queue of pending deallocations, which is handles
    > during various operations like fetching, saving, the end of the
    > event, etc.  Of course, resetting or releasing the MOC also purges
    > any pending deallocations.
    >
    > You can manually force the MOC to poll with -processPendingChanges.
    >
    > Under the retain model, objects fetched are always in the
    > autorelease pool, as well.
    > --
    >
    > -Ben
  • > Okay, so for completeness, i thought i'd share the resolution of this
    > with the list, in case someone finds this thread in the future.
    >
    > I set the app up to create small batches of ManagedObjects (several
    > hundred at a time) and added a small pause (0.01 sec.) between batches
    > via a delayed selector.

    You shouldn't need a pause, and you should have good results with
    batches in the thousands.

    > Before creating any ManagedObjects, i cancelled undo registration.
    >
    > Once a batch was created, i processed pending changes, saved, and
    > reset the managedObjectContext.
    > incidentally, resetting appears to create a brand new undo manager
    > for the context, so i disabled each time; there was no need to re-
    > enable when done.

    To disable the undo manager on NSManagedObjectContext, set it to nil.
    -reset does not recreate an undo manager.

    > Using an autorelease pool for each batch worked well, keeping memory
    > use extremely low.
    >
    > Using garbage collection resulted in a significant memory gain, but
    > nowhere near bad enough to crash the program.  Oddly, the memory use
    > did not subside after the loop had finished.

    How did you measure that ?  'heap' will provide more useful
    information than 'top'

    - Ben
  • On Nov 2, 2007, at 12:01 PM, John R. Timmer wrote:

    > Using an autorelease pool for each batch worked well, keeping memory
    > use extremely low.

    One other tip:  Switch from using [pool release] to using [pool drain]
    for your NSAutoreleasePool instances, and see how that affects the
    version of your application running under garbage collection.  It will
    look "wrong" at first (since you'll no longer be balancing the
    [[NSAutoreleasePool alloc] init] with a -release) but the -drain
    message will *not* be consumed by the runtime under GC as -release
    will.  Instead, it actually provides a hint to the collector that you
    it may be a good time to pick up some newly-generated garbage.

      -- Chris
  • On Nov 3, 2007, at 8:26 PM, Ben Trumbull wrote:

    >> Okay, so for completeness, i thought i'd share the resolution of this
    >> with the list, in case someone finds this thread in the future.
    >>
    >> I set the app up to create small batches of ManagedObjects (several
    >> hundred at a time) and added a small pause (0.01 sec.) between
    >> batches
    >> via a delayed selector.
    >
    > You shouldn't need a pause, and you should have good results with
    > batches in the thousands.

    Batches of about 350 just turned out to be what i wound up with when
    reading the input file a megabyte at a time.  The pause just helped me
    avoid a recursive call  and set me up for threading things.

    >
    >> Before creating any ManagedObjects, i cancelled undo registration.
    >>
    >> Once a batch was created, i processed pending changes, saved, and
    >> reset the managedObjectContext.
    >> incidentally, resetting appears to create a brand new undo manager
    >> for the context, so i disabled each time; there was no need to re-
    >> enable when done.
    >
    > To disable the undo manager on NSManagedObjectContext, set it to
    > nil.  -reset does not recreate an undo manager.

    Interesting.  I disabled it using disableUndoRegistration, and then
    tried to re-enable it after resetting the managedObjectContext, and
    the app logged a complaint that suggested it had already been re-
    enabled.

    >
    >
    >> Using an autorelease pool for each batch worked well, keeping memory
    >> use extremely low.
    >>
    >> Using garbage collection resulted in a significant memory gain, but
    >> nowhere near bad enough to crash the program.  Oddly, the memory use
    >> did not subside after the loop had finished.
    >
    > How did you measure that ?  'heap' will provide more useful
    > information than 'top'

    I had been using Top; for the GC version, rsize wound up more than
    double the manual collection, while vsize was up about .4GB (vsize
    didn't budge using manual collection).  I'm partway through rewriting
    things to thread, so i'm not really in a position to go back and make
    a comparison using heap.

    JT
previous month october 2007 next month
MTWTFSS
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
Go to today