Optimizing Core Data for large time series

  • Hi All,

    I'm experimenting with Core Data for working with very large data
    sets of neurological recordings and I'm looking for any suggestions
    on what others think the best way to tackle the problems of database
    performance/data storage and retrieval might be for this type of system.

    The data model is very simple, it is: Recording entity <-->>
    DataStream entities <-->> DataPoint entities. The recording can have
    up to 100s of datastreams, and the datapoints have just two
    parameters  (voltage and timePoint) but can number into the billions
    of points.

    I need to do three types of operations on this data:
    - import it by parsing the original raw data file and inserting it
    into my data model
    - display it (converting datapoints to 2D points in bezier paths)
    - analyze it (running algorithms that operate on each data point
    within a certain time range, or comparing streams against other streams)

    In my initial attempt I just did the simplest thing and kept them all
    in a single MOC and SQL store. This is ok for small recordings, but
    quickly became unwieldy for more realistic data sets (store read/
    writes become very slow). After looking at the Apple docs on CD
    performance and reading some of the posts here, I started coding the
    next version

    These are the changes am I experimenting with today to try and speed
    things up:

    - Splitting each stream off to its own context and store

    - Converting the DataPoints into BLOBs (and removing them from the
    data model) and keeping them in binary files which are referenced by
    a new entity, DataChunk, which has parameters: fileURL, timeBegin,
    timeEnd, numPoints. This creates other issues because I might take a
    performance hit accessing individual time points for processing,
    especially for non-sequential groups of points, but opening a file
    and moving a file pointer should be faster than fetching (am I right
    on this?)

    So I'm curious if anyone has experimented with this type of setup for
    large data sets, if you have any opinion on the tack I'm taking, and
    if there are any bugaboos to watch out for not covered in the CD
    performance docs.

    Also - is there a strong enough incentive to give up Core Data so
    that I can use a faster database than SQLite? (possibly Valentina)
    I'm loathe to do this because of the tight integration CD has with
    Cocoa and IB, but I'm wondering if the scale of the data I'm using
    just requires a higher performance database.

    I'm also looking for advice on graphical display of these types of
    time series data, i.e. fast loading into bezier paths, rapid scaling,
    averaging of data points for best display, scrolling animation, etc.
    if anyone would like to share or point me to the right references.

    Thanks for looking,
    Peter Passaro
  • On 8 mai 07, at 13:37, Peter Passaro wrote:

    > Hi All,
    >
    > I'm experimenting with Core Data for working with very large data
    > sets of neurological recordings and I'm looking for any suggestions
    > on what others think the best way to tackle the problems of
    > database performance/data storage and retrieval might be for this
    > type of system.
    >
    > The data model is very simple, it is: Recording entity <-->>
    > DataStream entities <-->> DataPoint entities. The recording can
    > have up to 100s of datastreams, and the datapoints have just two
    > parameters  (voltage and timePoint) but can number into the
    > billions of points.
    >
    > I need to do three types of operations on this data:
    > - import it by parsing the original raw data file and inserting it
    > into my data model
    > - display it (converting datapoints to 2D points in bezier paths)
    > - analyze it (running algorithms that operate on each data point
    > within a certain time range, or comparing streams against other
    > streams)
    >
    > In my initial attempt I just did the simplest thing and kept them
    > all in a single MOC and SQL store. This is ok for small recordings,
    > but quickly became unwieldy for more realistic data sets (store
    > read/writes become very slow). After looking at the Apple docs on
    > CD performance and reading some of the posts here, I started coding
    > the next version

    As you can see from my various posts, I'm sceptical about CD
    performances. I've myself been surprised several times. But one thing
    I'm pretty sure now, is that incremental changes are very fast. I
    mean that inserting one element (and saving the context) in a small
    store (i mean few objects are already written to disk) is as fast as
    inserting one element is a very big store (millions). So I'm
    surprised when you say "store read/writes *become* very slow" do you
    mean that it is slower and slower when you insert data?
    I won't say CD is fast, but at least it is smooth and fast on the
    incremental aspect.

    Remember that saving can be slow because of your hard disk since core
    data wait for the complete buffer flush to return from a save (see
    various posts on this point, in particular Bill Bumgarner's)

    I hope you're not in a loop that insert a DataPoint, then save....
    because CD will then go as fast as your disk :)
    inserting in a loop, then saving is of course the thing to do.

    >
    > These are the changes am I experimenting with today to try and
    > speed things up:
    >
    > - Splitting each stream off to its own context and store
    >
    > - Converting the DataPoints into BLOBs (and removing them from the
    > data model) and keeping them in binary files which are referenced
    > by a new entity, DataChunk, which has parameters: fileURL,
    > timeBegin, timeEnd, numPoints. This creates other issues because I
    > might take a performance hit accessing individual time points for
    > processing, especially for non-sequential groups of points, but
    > opening a file and moving a file pointer should be faster than
    > fetching (am I right on this?)
    >
    > So I'm curious if anyone has experimented with this type of setup
    > for large data sets, if you have any opinion on the tack I'm
    > taking, and if there are any bugaboos to watch out for not covered
    > in the CD performance docs.
    >
    > Also - is there a strong enough incentive to give up Core Data so
    > that I can use a faster database than SQLite? (possibly Valentina)
    > I'm loathe to do this because of the tight integration CD has with
    > Cocoa and IB, but I'm wondering if the scale of the data I'm using
    > just requires a higher performance database.
    >
    > I'm also looking for advice on graphical display of these types of
    > time series data, i.e. fast loading into bezier paths, rapid
    > scaling, averaging of data points for best display, scrolling
    > animation, etc. if anyone would like to share or point me to the
    > right references.
    >
    > Thanks for looking,
    > Peter Passaro
    >
  • On May 8, 2007, at 4:37 AM, Peter Passaro wrote:

    > - Converting the DataPoints into BLOBs (and removing them from the
    > data model) and keeping them in binary files which are referenced
    > by a new entity, DataChunk, which has parameters: fileURL,
    > timeBegin, timeEnd, numPoints. This creates other issues because I
    > might take a performance hit accessing individual time points for
    > processing, especially for non-sequential groups of points, but
    > opening a file and moving a file pointer should be faster than
    > fetching (am I right on this?)

    Another option you might consider is an entity which chunks together
    several samples in an NSData field, but which is still stored as part
    of the Core Data store. Each stream would contain one or more chunk
    entities. This should yield better performance without requiring you
    to roll your own infrastructure for managing BLOB data in separate
    files.

    HTH,

    -- Kaelin
  • I wanted to report back after a little more experimentation. I have
    found the optimum BLOB size for my application to be about 1Mb - the
    import and data access speed is reasonable at this size. What is
    concerning me now is that the storage overhead seems to be pretty
    hefty for placing these BLOBs inside an SQL persistent store. For
    each 1Mb BLOB placed in the store, I am adding roughly 20Mb to the
    SQL file. Can anybody give me some direction as to what all that
    overhead is? Is this typical for NSData objects stored as BLOBs in
    SQL stores?

    On 8 May 2007, at 18:19, Kaelin Colclasure wrote:

    >
    > Another option you might consider is an entity which chunks
    > together several samples in an NSData field, but which is still
    > stored as part of the Core Data store. Each stream would contain
    > one or more chunk entities. This should yield better performance
    > without requiring you to roll your own infrastructure for managing
    > BLOB data in separate files.
    >
    > HTH,
    >
    > -- Kaelin
    >

    Peter Passaro
  • Please ignore the previous message in this thread.  It was my own
    stupidity in not resetting my NSData objects - they were growing
    exponentially on each save to store.

    I wrote:
    > I wanted to report back after a little more experimentation. I have
    > found the optimum BLOB size for my application to be about 1Mb -
    > the import and data access speed is reasonable at this size. What
    > is concerning me now is that the storage overhead seems to be
    > pretty hefty for placing these BLOBs inside an SQL persistent
    > store. For each 1Mb BLOB placed in the store, I am adding roughly
    > 20Mb to the SQL file. Can anybody give me some direction as to what
    > all that overhead is? Is this typical for NSData objects stored as
    > BLOBs in SQL stores?

    Peter Passaro
previous month may 2007 next month
MTWTFSS
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31      
Go to today