NSOperationQueue broken?

  • I hate to blame an OS bug but I see no other explanation, so here we go.

    I have a program which uses NSOperationQueue heavily. It uses lots of
    different queues each of which has a max concurrent operation count
    set to 1. In this way, the NSOperationQueue functions as a
    serialization mechanism, like a lock but easier to use. This is
    extremely handy.

    One unusual thing it does, which I think may be the thing that kills
    it, is that it often enqueues a new operation from inside of an
    existing one being run on the same queue. This is to provide more
    granularity for cancellation or priorities. For example, when there is
    a lot of data to process, the NSOperation I enqueue just processes the
    first piece of it. Then when it's done processing, it enqueues a
    second NSOperation which will continue the processing. This way if I
    want to cancel processing or do some higher-priority operation, it
    doesn't have to wait for everything.

    Anyway, I very rarely get this exception:

    *** -[NSInvocationOperation start]: receiver has already started or finished

    And this then crashes the app, because it's happening on an internal
    NSOperationQueue thread which doesn't have an exception handler. The
    rarity made this really difficult to debug, but I finally twigged to
    the problem and wrote a test case which reproduces the exception
    easily... on some hardware. This is that test case:

    #import <Foundation/Foundation.h>

    @interface Tester : NSObject
    {
        NSOperationQueue *_queue;
    }

    - (void)test;

    @end

    @implementation Tester

    - (id)init
    {
        if((self = [super init]))
        {
            _queue = [[NSOperationQueue alloc] init];
            [_queue setMaxConcurrentOperationCount:1];
        }
        return self;
    }

    - (void)test
    {
        NSInvocationOperation *op = [[NSInvocationOperation alloc]
    initWithTarget:self selector:_cmd object:nil];
        [_queue addOperation:op];
        [op release];
    }

    @end

    int main(int argc, char **argv)
    {
        [NSAutoreleasePool new];

        NSMutableArray *testers = [NSMutableArray array];
        int i;
        for(i = 0; i < 10; i++)
            [testers addObject:[[[Tester alloc] init] autorelease]];

        for(Tester *tester in testers)
            [tester test];

        while(1) sleep(1000);
    }

    Compile and run and wait. On my Mac Pro it throws that exception
    crashes inside of ten seconds with this backtrace:

    #0  0x96480ff4 in ___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ ()
    #1  0x9207ee3b in objc_exception_throw ()
    #2  0x92db74de in -[NSOperation start] ()
    #3  0x92db7112 in __runop ()
    #4  0x902ae1f7 in _pthread_wqthread ()
    #5  0x902ae0aa in start_wqthread ()

    I've had a few other people test as well. Most of them said that they
    experienced the same crash. Some said they saw no crash. This is not a
    big surprise for a threading bug but it's a bit odd.

    Anyway, I would really love for this to be my bug, because then I
    could fix it and stop having this problem. If anyone sees anything I'm
    doing wrong then please tell me what it is. Failing that, if anyone
    happens to know of a workaround for this problem, I would love to hear
    it. Otherwise it's off to write an NSOperationQueue replacement, a
    task I'm steeling myself for but not looking forward to.

    For the Apple types among us, I have filed this as rdar://6332143 .

    Any tips, tricks, corrections, or criticisms are most welcome!

    Mike
  • On Oct 30, 2008, at 10:40, Michael Ash wrote:

    > [_queue addOperation:op];

    Have you tried using performSelectorOnMainThread: to force serializing
    of the queuing? It might be an acceptable workaround, if it works.
  • On Thu, Oct 30, 2008 at 2:45 PM, Quincey Morris
    <quinceymorris...> wrote:
    > On Oct 30, 2008, at 10:40, Michael Ash wrote:
    >
    >> [_queue addOperation:op];
    >
    > Have you tried using performSelectorOnMainThread: to force serializing of
    > the queuing? It might be an acceptable workaround, if it works.

    I haven't, simply because this stuff is performance critical and
    having it get blocked because the main thread is busy processing user
    input is unacceptable. So alas, while that may get around whatever bug
    I'm encountering, it's not a usable workaround for me.
  • Looks like you're hosing memory because the subsequent invocation
    operations are being released prior to their completion as seen in
    this call stack:

    #0    0x00001a0c in -[MyOp dealloc] at NSOp-Test.m:27
    #1    0x94fba20f in NSPopAutoreleasePool
    #2    0x9504f3a8 in -[NSOperation start]
    #3    0x00001ad6 in -[MyOp start] at NSOp-Test.m:39

    As a guess the NSOperationQueue places the operation in an
    autoreleasepool, not really retaining it itself. The nearest pool
    happens to be constructed in NSOperation -start. Upon exit boom!

    By reorganizing the problem you should be able to avoid this apparent
    bug. As I understand it you have a task that can be broken down into a
    serializable set of operations. Decompose the task into a Queue with N
    +1 operations. Each of the N operations are configured with their
    chunk of the data to process, the +1 is simply an operation whose
    dependent upon the complete set of N and serves to trigger a
    notification that the task is complete. If an ordering is required
    configure the dependency appropriately.

    To support canceling tell the queue to -cancelAllOperations.

    Jamie

    On Oct 30, 2008, at 11:51 AM, Michael Ash wrote:

    > On Thu, Oct 30, 2008 at 2:45 PM, Quincey Morris
    > <quinceymorris...> wrote:
    >> On Oct 30, 2008, at 10:40, Michael Ash wrote:
    >>
    >>> [_queue addOperation:op];
    >>
    >> Have you tried using performSelectorOnMainThread: to force
    >> serializing of
    >> the queuing? It might be an acceptable workaround, if it works.
    >
    > I haven't, simply because this stuff is performance critical and
    > having it get blocked because the main thread is busy processing user
    > input is unacceptable. So alas, while that may get around whatever bug
    > I'm encountering, it's not a usable workaround for me.
  • On Thu, Oct 30, 2008 at 6:04 PM, Jamie Johnson <jamiejj...> wrote:
    >
    > Looks like you're hosing memory because the subsequent invocation operations
    > are being released prior to their completion as seen in this call stack:
    >
    > #0      0x00001a0c in -[MyOp dealloc] at NSOp-Test.m:27
    > #1      0x94fba20f in NSPopAutoreleasePool
    > #2      0x9504f3a8 in -[NSOperation start]
    > #3      0x00001ad6 in -[MyOp start] at NSOp-Test.m:39
    >
    > As a guess the NSOperationQueue places the operation in an autoreleasepool,
    > not really retaining it itself. The nearest pool happens to be constructed
    > in NSOperation -start. Upon exit boom!

    It's a good theory. Alas, I don't think it's the case.

    To prove this, take my original test case and replace the -test method
    with this group of methods:

    - (void)test
    {
        [NSThread detachNewThreadSelector:@selector(_enqueueThread)
    toTarget:self withObject:nil];
    }

    - (void)_enqueueThread
    {
        while(1)
        {
            NSAutoreleasePool *pool = [NSAutoreleasePool new];
            NSInvocationOperation *op = [[NSInvocationOperation alloc]
    initWithTarget:self selector:@selector(_operationTarget) object:nil];
            [_queue addOperation:op];
            [op release];
            [pool release];
        }
    }

    - (void)_operationTarget
    {
    }

    This avoids the pitfall you mention but still (on my computer) throws
    the same exception as before.

    Based on the state of the program when it crashes, it appears that the
    problem is caused by a race condition which occasionally causes two of
    the worker threads that NSOperationQueue spawns to dequeue and execute
    the same NSOperation. Since an NSOperation is only supposed to run
    once, things fall down go boom. This is just a theory, mind, and I'm
    not sure of it yet.

    Mike
  • On Thu, Oct 30, 2008 at 5:30 PM, Michael Ash <michael.ash...> wrote:
    > Based on the state of the program when it crashes, it appears that the
    > problem is caused by a race condition which occasionally causes two of
    > the worker threads that NSOperationQueue spawns to dequeue and execute
    > the same NSOperation. Since an NSOperation is only supposed to run
    > once, things fall down go boom. This is just a theory, mind, and I'm
    > not sure of it yet.

    I'm not sure it would help, but it might. You could try using the
    dependency mechanism in addition to setting maxConcurrentOperations to
    1. Keep track of the last NSOperation in your for loop and assign in
    as you go.

    If that fixes it, it could be a race in maxConcurrentOperations.

    This is all highly speculative.
    -Colin
  • By overriding -start of NSInvocationOperation and wrapping the call to
    super in an try/catch block:

      [Switching to process 8541 local thread 0x2d03]
      Running…
      2008-10-30 16:13:08.446 NSOp-Test[8541:1703]
    NSInvalidArgumentException - *** -[MyOp start]: receiver has already
    started or finished, 0x113040
      2008-10-30 16:13:08.913 NSOp-Test[8541:1603]
    NSInvalidArgumentException - *** -[MyOp start]: receiver has already
    started or finished, 0x115e00
      2008-10-30 16:13:12.683 NSOp-Test[8541:3403]
    NSInvalidArgumentException - *** -[MyOp start]: receiver has already
    started or finished, 0x116d70
      2008-10-30 16:13:16.183 NSOp-Test[8541:2e03]
    NSInvalidArgumentException - *** -[MyOp start]: receiver has already
    started or finished, 0x110fe0
      objc[8541]: FREED(id): message start sent to freed object=0x115ca0
      [Switching to process 8541 thread 0x1403]
      Program received signal:  “EXC_BAD_INSTRUCTION”.
      [Switching to process 8541 thread 0x1403]
      kill
      quit

    There's no guarantee the diagnostic code I've added isn't introducing
    other problems, multiprocessing is a pita that way, but the above
    information leads me to conclude an operation can be freed in an
    unexpected manor. You're probably correct about a race condition. Also
    in this method I have tests to check isExecuting, isFinished, !
    isReady, and isCancelled and log accordingly; note you see none of
    that here but on occasion I have seen runs where these came back as
    true unexpectedly. Breaking on those cases it looked like memory was
    hosed.

    Unfortunately (or fortunately in this case) my project requires 10.4.
    But to support easy migration later I implement a compatible
    interfaces to NSOperation etal (only implemented what the project
    requires). Looks like you'll be doing similar or restructuring your
    solution.

    Please share what you find out.

    Jamie

    On Oct 30, 2008, at 5:30 PM, Michael Ash wrote:

    > On Thu, Oct 30, 2008 at 6:04 PM, Jamie Johnson <jamiejj...>
    > wrote:
    >>
    >> Looks like you're hosing memory because the subsequent invocation
    >> operations
    >> are being released prior to their completion as seen in this call
    >> stack:
    >>
    >> #0      0x00001a0c in -[MyOp dealloc] at NSOp-Test.m:27
    >> #1      0x94fba20f in NSPopAutoreleasePool
    >> #2      0x9504f3a8 in -[NSOperation start]
    >> #3      0x00001ad6 in -[MyOp start] at NSOp-Test.m:39
    >>
    >> As a guess the NSOperationQueue places the operation in an
    >> autoreleasepool,
    >> not really retaining it itself. The nearest pool happens to be
    >> constructed
    >> in NSOperation -start. Upon exit boom!
    >
    > It's a good theory. Alas, I don't think it's the case.
    >
    > To prove this, take my original test case and replace the -test method
    > with this group of methods:
    >
    > - (void)test
    > {
    > [NSThread detachNewThreadSelector:@selector(_enqueueThread)
    > toTarget:self withObject:nil];
    > }
    >
    > - (void)_enqueueThread
    > {
    > while(1)
    > {
    > NSAutoreleasePool *pool = [NSAutoreleasePool new];
    > NSInvocationOperation *op = [[NSInvocationOperation alloc]
    > initWithTarget:self selector:@selector(_operationTarget) object:nil];
    > [_queue addOperation:op];
    > [op release];
    > [pool release];
    > }
    > }
    >
    > - (void)_operationTarget
    > {
    > }
    >
    > This avoids the pitfall you mention but still (on my computer) throws
    > the same exception as before.
    >
    > Based on the state of the program when it crashes, it appears that the
    > problem is caused by a race condition which occasionally causes two of
    > the worker threads that NSOperationQueue spawns to dequeue and execute
    > the same NSOperation. Since an NSOperation is only supposed to run
    > once, things fall down go boom. This is just a theory, mind, and I'm
    > not sure of it yet.
    >
    > Mike
  • On Thu, Oct 30, 2008 at 10:16 PM, Colin Barrett <dogcow...> wrote:
    > On Thu, Oct 30, 2008 at 5:30 PM, Michael Ash <michael.ash...> wrote:
    >> Based on the state of the program when it crashes, it appears that the
    >> problem is caused by a race condition which occasionally causes two of
    >> the worker threads that NSOperationQueue spawns to dequeue and execute
    >> the same NSOperation. Since an NSOperation is only supposed to run
    >> once, things fall down go boom. This is just a theory, mind, and I'm
    >> not sure of it yet.
    >
    > I'm not sure it would help, but it might. You could try using the
    > dependency mechanism in addition to setting maxConcurrentOperations to
    > 1. Keep track of the last NSOperation in your for loop and assign in
    > as you go.
    >
    > If that fixes it, it could be a race in maxConcurrentOperations.
    >
    > This is all highly speculative.

    Thanks for the suggestion. I tried it out and it still crashes with
    that same exception. I even tried it without setting the
    maxConcurrentOperations and it still crashed. When I tried it in the
    case where the operations get enqueued in a separate thread instead of
    from the operation method, then it still crashed but took a long time
    (5-10 minutes, didn't count exactly) to do so. So it would seem that
    this helps somehow but not quite enough.

    This last case is particularly scary to me. I'm not doing anything
    that NSOperationQueue isn't explicitly set up for. I'm just posting
    NSOperations to queues that have been created and are sitting in
    memory. I'm not fiddling with the maxConcurrentOperations, I'm not
    doing anything weird like enqueueing a new operation from the middle
    of an existing one, I'm just spawning some threads which post
    operations to queues and let them run. And it *still* crashes.

    I hate to make this sort of sweeping pronouncement because it seems
    like a great way to end up looking like a fool, but at this point I
    can only conclude that NSOperationQueue is fundamentally broken and
    should not be used until this problem is fixed. Of course I would love
    to be wrong.

    Mike
  • [Reposted by request on this thread for the archives]

    At 13:40 -0400 30/10/08, Michael Ash wrote:
    > I hate to blame an OS bug but I see no other explanation, so here we go.
    >
    > I have a program which uses NSOperationQueue heavily. It uses lots of
    > different queues each of which has a max concurrent operation count
    > set to 1. In this way, the NSOperationQueue functions as a
    > serialization mechanism, like a lock but easier to use. This is
    > extremely handy.
    >
    > One unusual thing it does, which I think may be the thing that kills
    > it, is that it often enqueues a new operation from inside of an
    > existing one being run on the same queue. This is to provide more
    > granularity for cancellation or priorities. For example, when there is
    > a lot of data to process, the NSOperation I enqueue just processes the
    > first piece of it. Then when it's done processing, it enqueues a
    > second NSOperation which will continue the processing. This way if I
    > want to cancel processing or do some higher-priority operation, it
    > doesn't have to wait for everything.
    >
    > Anyway, I very rarely get this exception:
    >
    > *** -[NSInvocationOperation start]: receiver has already started or finished
    >
    > And this then crashes the app, because it's happening on an internal
    > NSOperationQueue thread which doesn't have an exception handler. The
    > rarity made this really difficult to debug, but I finally twigged to
    > the problem and wrote a test case which reproduces the exception
    > easily... on some hardware. This is that test case:

    I looked through this a bit further and am sadly coming to the
    conclusion that you're right, under some (quite easy to duplicate)
    circumstances, the NSInvocationOperation appears to be executed by
    two worker threads and chaos ensues.

    As near as I can tell, it appears safe if (and perhaps only if)
    addOperation only happens on the main thread (or perhaps only happens
    on a single thread).

    As soon as addOperation happens on multiple threads, even if the
    queue is set to  setMaxConcurrentOperationCount = 1, and even if no
    queue ever executes more than a single NSInvocationOperation, then
    the problem appears.

    I adjusted Michael's test such that a new Tester object (with new
    NSOperationQueue) is created each time so only a single operation is
    ever executed for any given NSOperationQueue and it still results in
    [NSInvocationOperation start] being called twice for the same
    operation.

    The only saving grace is that
    performSelectorOnMainThread:@selector(addOperation:), or only calling
    the addOperations from the main thread appears to resolves the
    problem, which means my recent use of NSOperation in soon to be
    shipping code should be safe, but I certainly would not be
    recommending NSOperationQueue be used as it stands and would not have
    used if I knew what I know now.

    Note that I am not at all convinced that the
    performSelectorOnMainThread guarantees the problem cannot happen,
    simply that it appears to resolve the problem.  I would not recommend
    using NSOperationQueue for any purpose until this is resolved.

    Very sad.
        Peter.

    --
                  Keyboard Maestro 3 Now Available!
                    Now With Status Menu triggers!

    Keyboard Maestro <http://www.keyboardmaestro.com/> Macros for your Mac
    <>          <<A href="http://download.stairways.com/">http://download.stairways.com/>
previous month october 2008 next month
MTWTFSS
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
Go to today