Core Data performance advice... creating relationships.

  • Hi. I'm converting a large database over to Core Data and I'm running
    in to some performance problems. I've read the performance part of the
    Core Data docs but I'm still not sure what to do to speed up my code.
    My basic problem is as follows:

    I have created all my entities and am in the process of creatingp the
    relationships. I have two entities, each with around 400,000 entries
    each. 'Foo' and 'Bar' are the two entities. They reference each other
    using a common 'ID' integer. I have created a one to many relationship
    from Foo to Bar (rel), along with the corresponding inverse.

    'Foo" <------>> 'Bar'

    My algorithm for creating these relationships is to fetch every entry
    in 'Foo', and enumerate through the resulting array building a fetch
    request for 'Bar' in the form of 'All entities in Bar where ID == x'.
    Then when I get that result, I set 'Foo.rel' to the NSArray returned
    by that fetch request.

    This technique has been working OK for smaller data sets but now that
    I am linking two very large tables together I am seeing terrible
    performance - My code is only creating around 20 relationships per
    second on my new Macbook Pro.

    Rather than blindly stumble around trying to finding some performance
    enhancements I thought I would ask the good people on this list for
    some advice. Can anyone point me in the right direction ?

    PS - I'm using an SQLite store, which when populated with these two
    tables is around 100MB on disk. I'm also a bit of a Core Data newbie
    so go easy on me 8)....

    Thank you.
  • On Jan 14, 2008 1:43 PM, Martin Linklater <mslinklater...> wrote:
    > This technique has been working OK for smaller data sets but now that
    > I am linking two very large tables together I am seeing terrible
    > performance - My code is only creating around 20 relationships per
    > second on my new Macbook Pro.

    I would start with Shark/Instrument to see where your time is going,
    then optimize.

    One thought: how often are you calling save?  If you save after every
    transaction, I can see your performance going through the floor.
  • > My algorithm for creating these relationships is to fetch every entry
    > in 'Foo', and enumerate through the resulting array building a fetch
    > request for 'Bar' in the form of 'All entities in Bar where ID == x'.
    > Then when I get that result, I set 'Foo.rel' to the NSArray returned
    > by that fetch request.

      If I understand your post correctly, it looks like you're trying to
    force Core Data into your RDB world and it's not intended for that.
    Core Data is *not* a relational database. It's an object graph
    management and persistence framework. If this is the case, you're
    making things way harder than they actually are. I suggest going
    through the Core Data Programming Guide, but in a nutshell:

      If you want to get all the Bars for Foo:

    id foo = //... assume you have a reference to one instance of foo
    NSSet * fooBars = [foo valueForKey:@"bars"];

      ... now you have a set of bars - only the bars that belong to foo
    (via it's to-many "bars" relationship key).

      If you want to associate a bunch of bars to a Foo instance:

    id foo = // ... again, assume an instance of Foo
    NSArray * bars = //... assume an array of Bar instances
    [[foo mutableSetValueForKey:@"bars"] addObjectsFromArray:bars];

      ... now foo's bars contain what was in the "bars" array. If you ask
    each individual Bar instance in that array what its "foo" is, they'll
    all tell you it's your "foo" instance (the one to which you added
    them).

      Coming from the other direction, if you set an individual bar's foo
    relationship (since it's to-one):

    id foo // ... same as before
    id bar // ... some bar instance

    [bar setValue:foo forKey:@"foo"];

      ... now if you ask "bar" for its "foo", you get the foo you
    specified above. Bonus: If you ask the bar's foo who all its bars are,
    you'll get a set including the bar you specified above.

      I hope this is clear and not fubar'd. ;-)

    --
    I.S.
  • > If I understand your post correctly

      You know what? Scratch that. It's clear I didn't understand. :-D ...
    and I had *just* sent that message to Erik about having the guts to be
    wrong. ;-) I'm just leaving work, too, so I can't even blame alcohol
    yet. (sigh)

      Scott's shark (or Instruments) suggestion is a good one. In addition
    to his "how often are you saving" question:

    - Are you using garbage collection? (see its notes about creating a
    bunch of short-lived objects at once)
    - If you're not using garbage collection, are you at least creating
    an autorelease pool and draining it every thousand objects or so?
    (there was a recent post on this list regarding that very approach
    with Core Data)

      Hopefully this post is more helpful than my last. ;-)

    --
    I.S.
  • I am using GC, yes. I'll take a look at the docs about short-lived
    objects.

    I've run my code through 'Instruments' and I'm not sure what to make
    of the results. I'm seeing all my fetch requests but it's not telling
    me what is taking the time. I need to work with it more and see what I
    can find.

    Thanks.

    On 14 Jan 2008, at 22:12, I. Savant wrote:

    >> If I understand your post correctly
    >
    > You know what? Scratch that. It's clear I didn't understand. :-D ...
    > and I had *just* sent that message to Erik about having the guts to be
    > wrong. ;-) I'm just leaving work, too, so I can't even blame alcohol
    > yet. (sigh)
    >
    > Scott's shark (or Instruments) suggestion is a good one. In addition
    > to his "how often are you saving" question:
    >
    > - Are you using garbage collection? (see its notes about creating a
    > bunch of short-lived objects at once)
    > - If you're not using garbage collection, are you at least creating
    > an autorelease pool and draining it every thousand objects or so?
    > (there was a recent post on this list regarding that very approach
    > with Core Data)
    >
    > Hopefully this post is more helpful than my last. ;-)
    >
    > --
    > I.S.
  • Using 'Instruments' my CPU usage breaks down pretty much like this:

    60% NSManagedObject setValue: forKey:
    20% NSManagedObjectContext executeFetchRequest:
    16% NSManagedObjectContext save:
    4% (misc)

    So it is setting the relationships themselves which seems to take most
    of the time.

    On 14 Jan 2008, at 22:18, Martin Linklater wrote:

    > I am using GC, yes. I'll take a look at the docs about short-lived
    > objects.
    >
    > I've run my code through 'Instruments' and I'm not sure what to make
    > of the results. I'm seeing all my fetch requests but it's not
    > telling me what is taking the time. I need to work with it more and
    > see what I can find.
    >
    > Thanks.
    >
    > On 14 Jan 2008, at 22:12, I. Savant wrote:
    >
    >>> If I understand your post correctly
    >>
    >> You know what? Scratch that. It's clear I didn't understand. :-D ...
    >> and I had *just* sent that message to Erik about having the guts to
    >> be
    >> wrong. ;-) I'm just leaving work, too, so I can't even blame alcohol
    >> yet. (sigh)
    >>
    >> Scott's shark (or Instruments) suggestion is a good one. In addition
    >> to his "how often are you saving" question:
    >>
    >> - Are you using garbage collection? (see its notes about creating a
    >> bunch of short-lived objects at once)
    >> - If you're not using garbage collection, are you at least creating
    >> an autorelease pool and draining it every thousand objects or so?
    >> (there was a recent post on this list regarding that very approach
    >> with Core Data)
    >>
    >> Hopefully this post is more helpful than my last. ;-)
    >>
    >> --
    >> I.S.

  • > Using 'Instruments' my CPU usage breaks down pretty much like this:
    >
    > 60% NSManagedObject setValue: forKey:
    > 20% NSManagedObjectContext executeFetchRequest:
    > 16% NSManagedObjectContext save:
    > 4% (misc)
    >
    > So it is setting the relationships themselves which seems to take
    > most of the time.
    >

    Of course, setting a relationship load the entire graph for that
    entity (eg.: foo.bar = newBar would load all foos in newBar, just to
    update the set called "foos"). You could :

    - Forget relationships and use fetched properties if the set is really
    large;
    - Increase the stalenessInterval ([NSManagedObjectContext
    setStalenessInterval:]), avoiding some faults but increasing RAM usage;
    - Ask for more help, because I'm out of ideas.

    :: marcelo.alves

    :: marcelo.alves
  • Martin,

    There is a Core Data Instruments template (i.e. not CPU Sampler).  If
    you use that, does it tell you there is a lot of file system access
    during faulting ?  This template is a high level overview of stuff
    Core Data thinks is kinda expensive; like disk I/O.  Of course,
    Instruments lets you use both sets of instruments simultaneously.

    You can also use SQL logging by passing command line arguments:
    -com.apple.CoreData.SQLDebug 1

    (and if you're running within Terminal)
    -com.apple.CoreData.SyntaxColoredLogging 1

    In your last message, you note that nearly 60% of the time is spent
    in setValueForKey:  What is the "Heaviest Stack Trace" that
    Instruments has in the extended detail view for that ?
    --

    -Ben
  • On Jan 14, 2008, at 5:12 PM, I. Savant wrote:

    >> If I understand your post correctly
    >
    > You know what? Scratch that. It's clear I didn't understand. :-D ...
    > and I had *just* sent that message to Erik about having the guts to be
    > wrong. ;-) I'm just leaving work, too, so I can't even blame alcohol
    > yet. (sigh)

    Could you elaborate on why you changed your position?  In reading your
    initial reply, I was inclined to agree that Core Data probably wasn't
    the best solution in this instance (i.e. it's a decent-sized data set
    and what was being described appeared to be a simple table join
    scenario with no clear OO requirements.)  What did you see in the
    original post that caused you to believe that, in fact, this really
    was a Core Data usage scenario?  Understanding your thinking might
    help people like myself who are still struggling with where to draw
    the line on using Core Data vs. a relational approach.

    Thanks,
    Phil
  • On Tuesday, January 15, 2008, at 04:22AM, "Ben Trumbull" <trumbull...> wrote:

    > In your last message, you note that nearly 60% of the time is spent
    > in setValueForKey:  What is the "Heaviest Stack Trace" that
    > Instruments has in the extended detail view for that ?

    As I drill down the stack about 15 layers CoreData nibbles away nearly half of that 60%, then it enters libsqlite3.0.dylib, which accounts for over half of that 60%.

    I have uploaded a screen grab of Instruments into my public iDisk - username 'mslinklater', filename 'Trace.jpg'.

    > --
    >
    > -Ben
    >
    >
  • On Tuesday, January 15, 2008, at 09:15AM, "Phil" <pbpublist...> wrote:

    > Could you elaborate on why you changed your position?  In reading your
    > initial reply, I was inclined to agree that Core Data probably wasn't
    > the best solution in this instance (i.e. it's a decent-sized data set
    > and what was being described appeared to be a simple table join
    > scenario with no clear OO requirements.)  What did you see in the
    > original post that caused you to believe that, in fact, this really
    > was a Core Data usage scenario?  Understanding your thinking might
    > help people like myself who are still struggling with where to draw
    > the line on using Core Data vs. a relational approach.

    I can't see why Core Data would not be applicable in this instance. My code basically parses a load of SQL table create commands and recreates the data within Core Data. Since Core Data supports an SQLite backing store I don't see why data size of amount of relational information would factor... My plan is also to leverage Bindings and Core Data to simplify my UI code. Core Data's ability to deal with complex fetch predicates is also a win for me since I can let Core Data do a lot of the heavy lifting within my dataset.

    In fact the only downside I can see of using Core Data over a 'proper' relational database is that Core Data fetch requests return the entire entity rather than specific entity attirbutes, but since I'm only dealing with a single user application and relatively small objects I can't see this minor performance problem being an issue at all. Once my backing store is created I'm only really going to be accessing it in a read-only manner. My plan is to store all dynamically created user data in a separate object model, so as to keep my data relationships as simple as possible.

    Core Data seems like the perfect fit for me. Plus it's a good learning exercise 8).

    I must add the caveat that I'm in no regard a 'Databse Programmer', so my opinions may be pretty naive about this...

    Thanks.
  • On Jan 14, 2008, at 1:43 PM, Martin Linklater wrote:

    > I have created all my entities and am in the process of creatingp
    > the relationships. I have two entities, each with around 400,000
    > entries each. 'Foo' and 'Bar' are the two entities. They reference
    > each other using a common 'ID' integer.

    What is this "common 'ID' integer" -- is it a critical part of the
    model for your data, or is it something that you just thought you
    should put in due to your experience with other frameworks?

    If it's something you can possibly avoid having in your data model, do
    so.  Core Data's relationship management handles things like object
    IDs for you.  Maintaining your own parallel IDs is just duplicating
    work, in a way that's almost guaranteed to be sub-optimal.

    > I have created a one to many relationship from Foo to Bar (rel),
    > along with the corresponding inverse.
    >
    > 'Foo" <------>> 'Bar'
    >
    > My algorithm for creating these relationships is to fetch every
    > entry in 'Foo', and enumerate through the resulting array building a
    > fetch request for 'Bar' in the form of 'All entities in Bar where ID
    > == x'. Then when I get that result, I set 'Foo.rel' to the NSArray
    > returned by that fetch request.

    That fetch request has to perform a full table scan for the instances
    (not "entities") of Bar whose ID property equals x, because unless
    you've told it to do so in your data model (and you're running
    Leopard), it won't know to create an index on that property.

    Furthermore, that table scan has to be against both data cached in
    memory *and* against the SQLite database on disk, in case any other
    users of that SQLite database (in the same process or in a different
    one) changed data that would match the query.

    My first instinct is that, instead of importing your data by first
    creating all instances and then establishing all relationships between
    them, you established relationships as you created instances.  If you
    need to improve the performance of that, it's sometimes possible to do
    so by keeping a cache of the instances you've already created so you
    can relate things to them without constantly issuing fetch requests
    that scan the database.

    As an example -- and these numbers depend on your data -- imagine
    keeping references to the last 100 instances your import process
    created in a dictionary, keyed by the ID you mentioned above.  As you
    create a Bar instance, you can relate it instantly if its
    corresponding Foo is in the cache; if not, you can pull its Foo into
    the cache, or (if it doesn't exist) create a placeholder Foo that will
    be populated with real data at the appropriate point in the import.

    Can you explain what your data model is in slightly more concrete
    terms than you have so far?  I think that'll ultimately help clarify a
    lot.

      -- Chris
  • On Tuesday, January 15, 2008, at 09:56AM, "Chris Hanson" <cmh...> wrote:
    > On Jan 14, 2008, at 1:43 PM, Martin Linklater wrote:

    > What is this "common 'ID' integer" -- is it a critical part of the
    > model for your data, or is it something that you just thought you
    > should put in due to your experience with other frameworks?

    The data model is not mine. It is the SQL data dump for the game 'Eve Online'. Details can be found here:

    http://games.chruker.dk/eve_online/datadump.php

    The ID numbers are simply how the data dump defines it's cross-entity linkage. I have not created them at all - I'm just using the data I'm provided with. I'm creating the relationships to replace having to explicitly fetch based on ID matches. I'm going to have to have a good think about whether I can eliminate the ID numbers from the import though - the data dependencies are pretty complex.

    > If it's something you can possibly avoid having in your data model, do
    > so.  Core Data's relationship management handles things like object
    > IDs for you.  Maintaining your own parallel IDs is just duplicating
    > work, in a way that's almost guaranteed to be sub-optimal.

    Understood - I will be looking at getting rid of redundancies soon.

    >
    > That fetch request has to perform a full table scan for the instances
    > (not "entities") of Bar whose ID property equals x, because unless
    > you've told it to do so in your data model (and you're running
    > Leopard), it won't know to create an index on that property.

    BINGO. I set the ID attributes to be indexed and it's going MUCH faster now. Don't I feel like an idiot. Is this 'indexed' flag documented at all ? I've had a good look at the docs and it doesn't stand out to me... but I'm probalby missing something.

    > Can you explain what your data model is in slightly more concrete
    > terms than you have so far?  I think that'll ultimately help clarify a
    > lot.

    See the above hyperlink.

    Thanks for your suggestions Chris. Much appreciated !

    Cheers.
  • On Jan 15, 2008, at 4:37 AM, Martin Linklater wrote:

    >
    > On Tuesday, January 15, 2008, at 09:15AM, "Phil" <pbpublist...>
    >> wrote:
    >
    >> Could you elaborate on why you changed your position?  In reading
    >> your
    >> initial reply, I was inclined to agree that Core Data probably wasn't
    >> the best solution in this instance (i.e. it's a decent-sized data set
    >> and what was being described appeared to be a simple table join
    >> scenario with no clear OO requirements.)  What did you see in the
    >> original post that caused you to believe that, in fact, this really
    >> was a Core Data usage scenario?  Understanding your thinking might
    >> help people like myself who are still struggling with where to draw
    >> the line on using Core Data vs. a relational approach.
    >
    > I can't see why Core Data would not be applicable in this instance.
    > My code basically parses a load of SQL table create commands and
    > recreates the data within Core Data. Since Core Data supports an
    > SQLite backing store I don't see why data size of amount of
    > relational information would factor... My plan is also to leverage
    > Bindings and Core Data to simplify my UI code. Core Data's ability
    > to deal with complex fetch predicates is also a win for me since I
    > can let Core Data do a lot of the heavy lifting within my dataset.

    You appear to be describing a traditional SQL database application
    (your latest post with the schema seems to confirm it.)  The first
    half of what you wrote has me thinking 'probably not a great
    application for Core Data.'  The second half re: UI makes sense if you
    understand the overhead you are going to incur and that the benefits
    provided are worth it.  Core Data is not just an OO wrapper to a
    relational database.  It uses SQLite as a persistent object store but
    don't be fooled into thinking that this is the same thing.  If your
    application naturally translates to an OO environment (i.e. if you
    were actually using an existing SQL database to persist objects which
    in reality have some associated behavior), then by all means go for
    it.  If not, you may be disappointed (i.e. you think you're having
    problems with a simple 2 table join now?  Just wait until you start
    trying to do multi-table joins with complex predicates) as you're
    piling on considerable overhead for what may end up being a small
    amount of benefit.

    > In fact the only downside I can see of using Core Data over a
    > 'proper' relational database is that Core Data fetch requests return
    > the entire entity rather than specific entity attirbutes, but since
    > I'm only dealing with a single user application and relatively small
    > objects I can't see this minor performance problem being an issue at
    > all. Once my backing store is created I'm only really going to be
    > accessing it in a read-only manner. My plan is to store all
    > dynamically created user data in a separate object model, so as to
    > keep my data relationships as simple as possible.

    There is much more to it than that as you are throwing a lot of work
    'over the wall' to Core Data which has a non-trivial cost in terms of
    performance and memory.  If you are used to SQL with other relational
    databases, you give up a whole lot of control and may find that you
    need to alter your approach (i.e. it is not the spoon that bends, but
    rather you) to certain problems because you can't necessarily tell
    Core Data how to act on your requests or get it to see the most
    optimal approach.  So in the end you have to trust it and, like many
    SQL engines from not too many years ago, it's not very bright.  You
    can use Core Data to provide much of the same functionality, but it
    will not begin to approach the performance of a relational database
    for traditional SQL workloads (even single-user with moderate sized
    data sets) in my experience.

    > Core Data seems like the perfect fit for me. Plus it's a good
    > learning exercise 8).

    Having said all this, if it's just something for your own use and you
    also want to learn Core Data... that makes sense ;-)

    > I must add the caveat that I'm in no regard a 'Databse Programmer',
    > so my opinions may be pretty naive about this...

    Similar caveat on my end: I have quite a bit of experience in the
    relational database world but am still learning Core Data.  (hence my
    original question... so I'd appreciate enlightenment on where I'm off-
    base in this reply.)

    Thanks,
    Phil
  • > Could you elaborate on why you changed your position?  In reading
    > your initial reply, I was inclined to agree that Core Data probably
    > wasn't the best solution in this instance (i.e. it's a decent-sized
    > data set and what was being described appeared to be a simple table
    > join scenario with no clear OO requirements.)

      Sure. Originally, I thought the OP was trying, in the data modeler,
    to add a property called "ID" (or something like it) and not use Core
    Data's relationships mechanism. I realized that wasn't the case after
    I'd sent my response.

      The schema he described:  Foo <------>> Bar  ... is perfectly
    suited to Core Data. The number of instances he mentioned, as
    discussed many, many times before on this list, is acceptable as well.

      It may be that there's simply no way to make this initial import of
    800,000 objects go any faster, but there are a few effective tuning
    tricks he could try first.

      One may be due to garbage collection (creating that many short-
    lived objects in a tight loop is ill-advised, for example). As another
    suggested, it could have to do with how often he calls "save" on the
    context.

      I firmly believe that, once imported, this particular store will
    run very quickly (with regards to fetching, relationship traversal,
    etc.); Core Data is pretty quick there. It's just that building the
    object graph is pretty intensive.

    --
    I.S.
  • On Jan 15, 2008, at 6:46 AM, Phil wrote:

    > Similar caveat on my end: I have quite a bit of experience in the
    > relational database world but am still learning Core Data.  (hence
    > my original question... so I'd appreciate enlightenment on where I'm
    > off-base in this reply.)

      As in my last e-mail, there is positively nothing wrong with a) the
    schema he proposed and, b) the number of instances.

      The problem (as I'm gathering from the details that have emerged in
    this thread) is his approach: He's creating everything first, then
    establishing the relationships.

    --
    I.S.
  • On 1/15/08 2:31 AM, Martin Linklater said:

    >> That fetch request has to perform a full table scan for the instances
    >> (not "entities") of Bar whose ID property equals x, because unless
    >> you've told it to do so in your data model (and you're running
    >> Leopard), it won't know to create an index on that property.
    >
    > BINGO. I set the ID attributes to be indexed and it's going MUCH faster
    > now. Don't I feel like an idiot. Is this 'indexed' flag documented at
    > all ? I've had a good look at the docs and it doesn't stand out to me...
    > but I'm probalby missing something.

    I think the 'indexed' flag is new in 10.5, and the docs are lagging.
    The "Core Data Programming Guide" contains the word "indexed" once, and
    it's got nothing to do with it.

    <http://developer.apple.com/documentation/Cocoa/Conceptual/CoreData/
    CoreData.pdf
    >

    Anyone know of any docs on "indexed"?

    --
    ____________________________________________________________
    Sean McBride, B. Eng                <sean...>
    Rogue Research                        www.rogue-research.com
    Mac Software Developer              Montréal, Québec, Canada
  • On Jan 15, 2008, at 7:14 AM, I. Savant wrote:

    > On Jan 15, 2008, at 6:46 AM, Phil wrote:
    >
    >> Similar caveat on my end: I have quite a bit of experience in the
    >> relational database world but am still learning Core Data.  (hence
    >> my original question... so I'd appreciate enlightenment on where
    >> I'm off-base in this reply.)
    >
    > As in my last e-mail, there is positively nothing wrong with a) the
    > schema he proposed and, b) the number of instances.
    >
    > The problem (as I'm gathering from the details that have emerged in
    > this thread) is his approach: He's creating everything first, then
    > establishing the relationships.

    I appreciate your followup messages.  I had tried a schema of similar
    complexity (granted, it was a larger data set) and had fairly
    unfavorable results under Tiger (re: memory usage and performance.)
    It sounds like it might be worth another look under Leopard.

    Thanks,
    Phil
  • > I appreciate your followup messages.  I had tried a schema of similar
    > complexity (granted, it was a larger data set) and had fairly
    > unfavorable results under Tiger (re: memory usage and performance.)
    > It sounds like it might be worth another look under Leopard.

      With a statement like that, you've *got* to provide more detail. :-D
    What was your schema? What was the size of the data set (approximate
    number of each entity's instances, if you know it)? What did you find
    to be slow/memory-intensive, specifically? How did you implement that
    which you found to be slow/memory-intensive?

      Inquiring minds want to know! I'm not saying "I don't believe it,
    prove it!" ... merely curious. I too have come across a number of
    inefficiencies but most of them had fairly simple work-arounds once
    the problem area was identified.

    --
    I.S.
  • > Inquiring minds want to know! I'm not saying "I don't believe it,
    > prove it!" ... merely curious. I too have come across a number of
    > inefficiencies but most of them had fairly simple work-arounds once
    > the problem area was identified.

      With a statement like *that*, I feel *I* have to be more specific.
    :-) By inefficiencies, I mean that most of them were design problems
    with my model or "angle of approach" to getting to the data I'm
    interested in. One was specifically due to a Core Data limitation
    (compound ANY/NOT ANY/NONE predicates with a SQLite store), but it
    turns out the feature that necessitated that particular predicate was
    a "gee-whiz" feature that wasn't really needed anyway, so dropping it
    entirely made that problem go away completely. ;-)

    --
    I.S.
  • Briefly for now...

    On Jan 15, 2008, at 3:46 AM, Phil wrote:
    > You appear to be describing a traditional SQL database application
    > (your latest post with the schema seems to confirm it.)  The first
    > half of what you wrote has me thinking 'probably not a great
    > application for Core Data.'  The second half re: UI makes sense if
    > you understand the overhead you are going to incur and that the
    > benefits provided are worth it.  Core Data is not just an OO wrapper
    > to a relational database.  It uses SQLite as a persistent object
    > store but don't be fooled into thinking that this is the same thing.
    >
    I don't believe he is, and I.S. has already made the point that it's
    an object-graph management and persistence framework.  If in your
    application you leverage that feature, it probably is a good fit.

    > There is much more to it than that as you are throwing a lot of work
    > 'over the wall' to Core Data which has a non-trivial cost in terms
    > of performance and memory.
    >
    It's not clear on what basis you make this assertion.

    > Similar caveat on my end: I have quite a bit of experience in the
    > relational database world but am still learning Core Data.  (hence
    > my original question... so I'd appreciate enlightenment on where I'm
    > off-base in this reply.)
    >
    See <http://developer.apple.com/documentation/Cocoa/Conceptual/CoreData/Articles
    /cdFAQ.html#//apple_ref/doc/uid/TP40001802-DontLinkElementID_27
    >

    mmalc
  • On Jan 15, 2008, at 11:22 AM, I. Savant wrote:

    >> I appreciate your followup messages.  I had tried a schema of similar
    >> complexity (granted, it was a larger data set) and had fairly
    >> unfavorable results under Tiger (re: memory usage and performance.)
    >> It sounds like it might be worth another look under Leopard.
    >
    > With a statement like that, you've *got* to provide more detail. :-D
    > What was your schema? What was the size of the data set (approximate
    > number of each entity's instances, if you know it)? What did you find
    > to be slow/memory-intensive, specifically? How did you implement that
    > which you found to be slow/memory-intensive?
    >
    > Inquiring minds want to know! I'm not saying "I don't believe it,
    > prove it!" ... merely curious. I too have come across a number of
    > inefficiencies but most of them had fairly simple work-arounds once
    > the problem area was identified.

    No problem providing more detail...

    When I was first looking at Core Data a couple of years ago, I
    attempted to migrate a moderate sized data set (~300 Meg in a vanilla
    SQLite database over about two dozen tables) for an application that
    was very data processing oriented (i.e. very little GUI, mostly an ETL
    job) to Core Data and after getting everything loaded, setting up
    relationships and predicate templates that were as selective as
    possible, it still appeared to be performing full table scans most, if
    not all, of the time because Core Data would apparently not see the
    value of index creation of critically important columns from a
    performance standpoint (I confirmed via the command line SQLite
    interface but had read and was told that CD would take care of such
    things automatically.  I'm sure it did but have no idea what criteria
    it used to determine since it didn't seem to take any of the hints I
    was giving it... was it using: relationships in the model? predicate
    templates? database stats? phase of the moon?) and I couldn't see a
    (supported) way to force it to create them.

    For example, one of the key tables involved was ~10 Million rows of
    daily stock quote data (i.e. ticker, date_id, open, high, low, close,
    etc.) which had relationships to the security (ticker, companyName,
    etc.) and date (date_id, date, yearMonth, etc) tables.  For the life
    of me, I could never figure out how to get Core Data to decide to
    create the appropriate indexes so that when I was after all daily
    quotes where companyName='Apple Computer' and
    yearMonth='2007-11' (i.e. highly selective criteria) it would always
    want to fetch back 10 million managed objects even though I really
    only cared, at most, about a few thousand.  It seemed like the more
    selective my query got, the worse Core Data would perform.  I assumed
    that this was because it was deciding that it needed to fetch
    everything back and apply the filtering criteria itself which makes
    sense if that is what it was actually doing.  This assumption
    appeared to be supported by watching memory usage for my app memory
    usage to shoot up to 500+ Meg at times even though I was never working
    with more than a fraction of the data set at a time and being quite
    liberal in my use of autorelease pools.  (This was mostly speculation
    on my part because Core Data was fairly opaque re: what was going on
    with the data store so I was never really certain what was going on.)

    This was all quite some time ago and I didn't bother revisiting it for
    anything nearly that large since as there were other things about Core
    Data that turned me off to using it for anything that was primarily
    database-oriented in nature.  As some of these issues appear to be
    addressed in Leopard, it looks like it's worth taking at least another
    look and possibly expanding the cases where I should consider it as a
    solution.

    Thanks,
    Phil
  • Martin,

    Looking at the Instruments snapshot, one can see most of the time is
    spent maintaining the inverse relationship as that is triggering
    faulting of unloaded data.

    When you dirty a managed object for the first time, we need its
    current data to establish a snapshot.  The snapshot is used for
    several features including undo, merge policies & conflict detection,
    and the committedValuesForKeys/changedValues methods.  This includes
    the identities (but not full data) of object's current relationships.

    For to-many relationships, this can trigger additional faulting (in
    database terms, a fetch against the join table)

    You can address the problem by prefetching Foo's to-many
    relationships during your initial fetch.  If Bar has other to-many
    relationships, you'll want to prefetch those in the fetch request for
    "bar where ID = X"

    The Instruments set you used doesn't include all the Core Data
    instruments from the main template, specifically "Core Data Cache
    Misses", so Instruments isn't high lighting the specific problem as
    well as it could.

    If you add that instrument, you should see both the entity and
    relationships that are being faulted.
    --

    -Ben
  • > Martin,
    >
    > Looking at the Instruments snapshot, one can see most of the time is
    > spent maintaining the inverse relationship as that is triggering
    > faulting of unloaded data.
    >
    > When you dirty a managed object for the first time, we need its
    > current data to establish a snapshot.  The snapshot is used for
    > several features including undo, merge policies & conflict
    > detection, and the committedValuesForKeys/changedValues methods.
    > This includes the identities (but not full data) of object's current
    > relationships.
    >
    > For to-many relationships, this can trigger additional faulting (in
    > database terms, a fetch against the join table)
    >
    > You can address the problem by prefetching Foo's to-many
    > relationships during your initial fetch.  If Bar has other to-many
    > relationships, you'll want to prefetch those in the fetch request
    > for "bar where ID = X"
    >
    > The Instruments set you used doesn't include all the Core Data
    > instruments from the main template, specifically "Core Data Cache
    > Misses", so Instruments isn't high lighting the specific problem as
    > well as it could.
    >
    > If you add that instrument, you should see both the entity and
    > relationships that are being faulted.

    Thanks Ben. I'll take your comments on board and see if I can't speed
    things up.

    Cheers.
  • Phil,

    On Tiger, Core Data only creates indices for relationships (joins).
    On Leopard, in the modeling tool, you can also specify that an
    attribute should have a binary index.

    Binary indices can't be used for string matching predicates
    (contains, like, etc) or case/diacritic insensitivity.  Literal
    string equality and prefix searching can use binary indices.  As of
    Leopard, Core Data does not support custom collations.

    For queries that cannot use an index (or an index does not exist),
    Core Data relies on SQLite.  Core Data does not fetch everything into
    memory (instantiate objects) to perform a table scan.  Obviously,
    SQLite walks the table's entire b-tree for a table scan.

    Complex text queries (like case insensitive LIKE) can be very
    expensive as Core Data handles unicode and locale issues.  It can be
    helpful to denormalize some data to make it eligible for a literal
    string comparison.  Some cleverness with literal string operations
    such as = and < and >= upon a canonical literal string (i.e. a search
    column v.s. a display column) can produce performance improvements of
    several orders of magnitude.

    It can also be helpful to organize your predicate in order of maximum
    discrimination to reduce the number of rows SQLite examines.  That
    is, for compound where clauses, put the simplest or most
    discriminating subexpressions first, and expensive text expressions
    last.

    Both SQLite and Core Data have made significant performance
    improvements since Tiger (3 years ago).  SQLite now has a (primitive)
    query optimizer.  Core Data fetches are much faster, large fetches
    are implicitly multi-threaded, and memory usage is significantly
    reduced.

    Also, the array controller has a "lazy fetching" option in entity
    mode, which is a bit like a cursor.  This only fetches the identity
    (PK) of the rows for the entire fetch request, and the actual data
    for the rows in use (displayed) by the array controller.

    Finally, the tools on Leopard should make debugging performance
    problems easier.  Instruments has a standard Core Data template, and
    the SQL logging includes annotations for durations.

    Please file bug reports with bugreport.apple.com early and often.
    Performance problems are bugs.  Deciding not to use a technology
    because it crashes or it's too slow amounts to the same thing.
    --

    -Ben
  • Ben,

    Thanks for your chick-full-of-information reply.  Several of the items
    listed in your message, and my not being aware of them, were likely
    contributing factors to my experiences as I pretty much made my mind
    up about where Core Data fit in the time period between using the
    Tiger betas and the first 6 months or so of the Tiger release.
    Performance issues, as well as what I perceived as my inability to do
    much of anything to address them due to lack of information about what
    Core Data was actually doing, was one of the reasons I closed the book
    on it for many applications.  Progress has clearly been made and it's
    time for me to dive back in and re-evaluate.

    Is the information from your message (and hopefully more) available in
    a TechNote or some other public Apple document on optimization and
    performance tuning Core Data?  If not, would a bug report be the
    appropriate way to request that one be created?

    Finally, I noticed that the Core Data template is not available by
    default in Xcode via Run->Start with Performance Tool... is this by
    design or just an oversight?  It's a trivial matter to create a new
    template of my own which appears there and just wanted to know if I
    should file a bug report on this.

    Thanks,
    Phil
  • On Tuesday, January 15, 2008, at 10:51PM, "Ben Trumbull" <trumbull...> wrote:
    > You can address the problem by prefetching Foo's to-many
    > relationships during your initial fetch.  If Bar has other to-many
    > relationships, you'll want to prefetch those in the fetch request for
    > "bar where ID = X"
    >
    > The Instruments set you used doesn't include all the Core Data
    > instruments from the main template, specifically "Core Data Cache
    > Misses", so Instruments isn't high lighting the specific problem as
    > well as it could.
    >
    > If you add that instrument, you should see both the entity and
    > relationships that are being faulted.

    Hi Ben - I have done what you suggested. My original code was indeed firing faults for every iteration through my loop - as shown when I ran my application through Instruments with the CoreData template.

    I have now changed my code to use [NSFetchRequest setRelationshipKeyPathsForPrefetching] as detailed in the Core Data Performance documentation, and my code is no longer firing faults during the main loop. Excellent.

    But the poblem I'm seeing is that enabling the prefetching has actually slowed my application down by around 10%. I have run a few tests with different subsets of my data, and they all show this performance effect. I can only surmise that the increased time needed for the initial fetch is slower than the overhead for all the smalled fault triggers.

    Does this surprise you at all ? I may have simply hit the limit of CoreData performance for inserting relationships into a dataset ? Since the data I am inserting relationships into has only just been created, does this mean that the entire dataset is already in RAM, so the fault overheads were minimal to beging with ?

    Thanks.
previous month january 2008 next month
MTWTFSS
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      
Go to today