Does initWithHTML:dataUsingEncoding:documentAttributes: run an event loop?

  • Dear all,

    I am wondering:
    By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?

    Jean
    -----------
    Jean Suisse
    Institut de Chimie Moléculaire de l’Université de Bourgogne
    (ICMUB) — UMR 6302
  • On 7 May 2013, at 23:37, Jean Suisse <jean.lists...> wrote:

    > Dear all,
    >
    > I am wondering:
    > By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?

    I believe so, yes. It's currently implemented using WebKit, which would generally require spinning the runloop while any asynchronous work is performed.
  • Thank you for this quick response.
    I suspected so. Unfortunately,
    I just spent five hours straight tracking a random bug – not even remotely related to strings – that seemed to occur when a single thread successively triggered two events handled by a callback tied to an input source on the event loop (took me some time to get that). Of course the callback isn't thread safe… and never was designed to be since tied to the runloop. Until now...

    Jean

    On 8 mai 2013, at 00:40, Mike Abdullah wrote:

    >
    > On 7 May 2013, at 23:37, Jean Suisse wrote:
    >
    >> Dear all,
    >>
    >> I am wondering:
    >> By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?
    >
    > I believe so, yes. It's currently implemented using WebKit, which would generally require spinning the runloop while any asynchronous work is performed.
    >

    -----------
    Jean Suisse
    Institut de Chimie Moléculaire de l’Université de Bourgogne
    (ICMUB) — UMR 6302
  • On May 7, 2013, at 5:37 PM, Jean Suisse wrote:

    > By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?

    Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.

    Consider the other method, -[NSAttributedString initWithHTML:options:documentAttributes:] and its options dictionary, which can have keys like NSTimeoutDocumentOption and NSWebResourceLoadDelegateDocumentOption.  Those imply a pretty involved process under the hood.  I'm fairly certain that this method is used for the implementation of the method you're using.

    You may be able to use those options to minimize the use of the run loop, but probably not eliminate it.

    Regards,
    Ken
  • Thanks for this suggestion. Actually, once identified, I fixed the bug fairly quickly by postponing the task (dispatched a block on the main thread for that).
    The bug was caused by two successive events arriving in a short timeframe so that the stack looked like that:

    0 my callback
    1 __CFSocketPerformV0
    ------------------------------------
    9 initWithHTML:dataUsingEncoding:documentAttributes:
    10 my function 1
    11 my function 2
    12 my callback
    13 __CFSocketPerformV0
    -----------------------------------
    23 NSApplicationMain
    24 main
    25 Start

    and my callback never was designed to be called a second time before the first call was finished. Dispatching a block to handle my function 1 fixed the issue.
    However, information about initWithHTML:dataUsingEncoding:documentAttributes: running an event loop could be worth mentioning in the docs…

    Jean

    On 8 mai 2013, at 00:52, Ken Thomases wrote:

    > On May 7, 2013, at 5:37 PM, Jean Suisse wrote:
    >
    >> By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?
    >
    > Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.
    >
    > Consider the other method, -[NSAttributedString initWithHTML:options:documentAttributes:] and its options dictionary, which can have keys like NSTimeoutDocumentOption and NSWebResourceLoadDelegateDocumentOption.  Those imply a pretty involved process under the hood.  I'm fairly certain that this method is used for the implementation of the method you're using.
    >
    > You may be able to use those options to minimize the use of the run loop, but probably not eliminate it.
    >
    > Regards,
    > Ken
    >

    -----------
    Jean Suisse
    Institut de Chimie Moléculaire de l’Université de Bourgogne
    (ICMUB) — UMR 6302
  • On May 7, 2013, at 3:52 PM, Ken Thomases <ken...> wrote:

    > Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.

    I’ve had trouble with this method in the past, for exactly that reason — you can get weird reentrancy problems from runloop sources like timers being invoked while in the middle of the call. (It’s also pretty slow.)

    IMHO it’s best to avoid this method if you can. For example, the last time this came up all I needed was the plain text, so I wrote a little string transformer to strip out HTML tags and expand HTML entities. For more involved work you could use NSXMLParser (with the “tidy” option) to parse the HTML into a DOM and then walk through that.

    —Jens
  • On May 8, 2013, at 3:14 PM, Jens Alfke <jens...> wrote:
    > On May 7, 2013, at 3:52 PM, Ken Thomases <ken...> wrote:
    >
    >> Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.
    >
    > I’ve had trouble with this method in the past, for exactly that reason — you can get weird reentrancy problems from runloop sources like timers being invoked while in the middle of the call. (It’s also pretty slow.)
    >
    > IMHO it’s best to avoid this method if you can. For example, the last time this came up all I needed was the plain text, so I wrote a little string transformer to strip out HTML tags and expand HTML entities. For more involved work you could use NSXMLParser (with the “tidy” option) to parse the HTML into a DOM and then walk through that.

    Yup. I had edge-case crashes too (fortunately reproducible one I knew the right edge case), and spent hours tracking it down to reentrancy problems in initWithHTML. Fortunately I could count on getting well-formed XML, and like Jens all I needed was to extract plain text, so I changed my solution to use NSXMLDocument and the crash went away.

    Here's the link to the NSAttributedString class reference. I just now went to the bottom of the page and submitted feedback requesting that the docs warn about this pitfall. Perhaps others could do the same, on the chance that it will help some future programmer track down the problem a little sooner.

    https://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Appl
    icationKit/Classes/NSAttributedString_AppKitAdditions/Reference/Reference.h
    tml


    I referenced this email thread in my comments:

    http://lists.apple.com/archives/cocoa-dev/2013/May/msg00117.html

    --Andy
  • On May 8, 2013, at 6:25 PM, Andy Lee <aglee...> wrote:

    > Yup. I had edge-case crashes too (fortunately reproducible one I knew the right edge case), and spent hours tracking it down to reentrancy problems in initWithHTML. Fortunately I could count on getting well-formed XML, and like Jens all I needed was to extract plain text, so I changed my solution to use NSXMLDocument and the crash went away.

    You actually don’t need well-formed X[H]TML to use NSXMLDocument. One of the option flags to the -init method tells it to run the ‘htmltidy’ preprocessor over the input, which will correct even the gnarliest hand-written tag-soup HTML into something the XML parser can handle. It’s extremely useful for handling random web content.

    —Jens
  • On 5/9/13 6:26 PM, Jens Alfke wrote:
    >> Yup. I had edge-case crashes too (fortunately reproducible one I knew the
    >> right edge case), and spent hours tracking it down to reentrancy problems
    >> in initWithHTML. Fortunately I could count on getting well-formed XML, and
    >> like Jens all I needed was to extract plain text, so I changed my solution
    >> to use NSXMLDocument and the crash went away.
    >
    > You actually don’t need well-formed X[H]TML to use NSXMLDocument. One of the
    > option flags to the -init method tells it to run the ‘htmltidy’ preprocessor
    > over the input, which will correct even the gnarliest hand-written tag-soup
    > HTML into something the XML parser can handle. It’s extremely useful for
    > handling random web content.

    Well, that's not entirely true, unfortunately. Although the documentation
    suggests you can, NSXMLDocument -init.... will crash if the content you're
    trying to feed it is sufficiently non-XML (say an ASCII text file).

    We get this all the time and it's a major pain.

    Regards
    Markus
    --
    __________________________________________
    Markus Spoettl
  • On May 9, 2013, at 1:53 PM, Markus Spoettl <ms_lists...> wrote:

    > On 5/9/13 6:26 PM, Jens Alfke wrote:
    >>> Yup. I had edge-case crashes too (fortunately reproducible one I knew the
    >>> right edge case), and spent hours tracking it down to reentrancy problems
    >>> in initWithHTML. Fortunately I could count on getting well-formed XML, and
    >>> like Jens all I needed was to extract plain text, so I changed my solution
    >>> to use NSXMLDocument and the crash went away.
    >>
    >> You actually don’t need well-formed X[H]TML to use NSXMLDocument. One of the
    >> option flags to the -init method tells it to run the ‘htmltidy’ preprocessor
    >> over the input, which will correct even the gnarliest hand-written tag-soup
    >> HTML into something the XML parser can handle. It’s extremely useful for
    >> handling random web content.

    Good to know.

    > Well, that's not entirely true, unfortunately. Although the documentation suggests you can, NSXMLDocument -init.... will crash if the content you're trying to feed it is sufficiently non-XML (say an ASCII text file).
    >
    > We get this all the time and it's a major pain.

    And good to know.

    Thanks!

    --Andy
  • On May 9, 2013, at 10:53 AM, Markus Spoettl <ms_lists...> wrote:

    > Well, that's not entirely true, unfortunately. Although the documentation suggests you can, NSXMLDocument -init.... will crash if the content you're trying to feed it is sufficiently non-XML (say an ASCII text file).

    Well, that’s bad, especially since one of the purposes of tidy is to make it safe to read untrusted XML/HTML input.
    Have you reported this to Radar?

    —Jens
  • I'd strongly recommend a great tool from the DTCoreText github project - DTHTMLWriter and DTHTMLReader. It is designed to work with HTML documents and turn them into XML or like (he uses it for NSAttributedStrings).

    I've been using this project very heavily and it works extremely well. For example I use it to convert HTML to ENML for Evernote. It works with loads of poorly formatted HTML.

    On May 9, 2013, at 12:58 PM, Jens Alfke <jens...> wrote:

    >
    > On May 9, 2013, at 10:53 AM, Markus Spoettl <ms_lists...> wrote:
    >
    >> Well, that's not entirely true, unfortunately. Although the documentation suggests you can, NSXMLDocument -init.... will crash if the content you're trying to feed it is sufficiently non-XML (say an ASCII text file).
    >
    > Well, that’s bad, especially since one of the purposes of tidy is to make it safe to read untrusted XML/HTML input.
    > Have you reported this to Radar?
    >
    > —Jens
  • Thanks for your replies.
    Unfortunately, I can't easily avoid initWithHTML:dataUsingEncoding:documentAttributes:
    But I can postpone it long enough to move its execution to an other thread (serial dispatch queue).
    That solves the issue.

    Jean

    On 8 mai 2013, at 21:14, Jens Alfke <jens...> wrote:

    >
    > On May 7, 2013, at 3:52 PM, Ken Thomases <ken...> wrote:
    >
    >> Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.
    >
    > I’ve had trouble with this method in the past, for exactly that reason — you can get weird reentrancy problems from runloop sources like timers being invoked while in the middle of the call. (It’s also pretty slow.)
    >
    > IMHO it’s best to avoid this method if you can. For example, the last time this came up all I needed was the plain text, so I wrote a little string transformer to strip out HTML tags and expand HTML entities. For more involved work you could use NSXMLParser (with the “tidy” option) to parse the HTML into a DOM and then walk through that.
    >
    > —Jens

    -----------
    Jean Suisse
    Institut de Chimie Moléculaire de l’Université de Bourgogne
    (ICMUB) — UMR 6302
  • On May 15, 2013, at 8:58 AM, Jean Suisse wrote:

    > Thanks for your replies.
    > Unfortunately, I can't easily avoid initWithHTML:dataUsingEncoding:documentAttributes:
    > But I can postpone it long enough to move its execution to an other thread (serial dispatch queue).
    > That solves the issue.

    I doubt it will.  If that method is invoked from a background thread, it will shunt the work to the main thread anyway.  In other words, it always does its work on the main thread.

    Regards,
    Ken
  • Right. I dispatch the block on the main thread to solve the reentrancy issues as stated below. The serial queue is involved elsewhere.

    On 15 mai 2013, at 16:12, Ken Thomases wrote:

    > On May 15, 2013, at 8:58 AM, Jean Suisse wrote:
    >
    >> Thanks for your replies.
    >> Unfortunately, I can't easily avoid initWithHTML:dataUsingEncoding:documentAttributes:
    >> But I can postpone it long enough to move its execution to an other thread (serial dispatch queue).
    >> That solves the issue.
    >
    > I doubt it will.  If that method is invoked from a background thread, it will shunt the work to the main thread anyway.  In other words, it always does its work on the main thread.
    >
    > Regards,
    > Ken
    >

    On 8 may 2013, at 01:06, Jean Suisse wrote:

    > Thanks for this suggestion. Actually, once identified, I fixed the bug fairly quickly by postponing the task (dispatched a block on the main thread for that).
    > The bug was caused by two successive events arriving in a short timeframe so that the stack looked like that:
    >
    > 0 my callback
    > 1 __CFSocketPerformV0
    > ------------------------------------
    > 9 initWithHTML:dataUsingEncoding:documentAttributes:
    > 10 my function 1
    > 11 my function 2
    > 12 my callback
    > 13 __CFSocketPerformV0
    > -----------------------------------
    > 23 NSApplicationMain
    > 24 main
    > 25 Start
    >
    > and my callback never was designed to be called a second time before the first call was finished. Dispatching a block to handle my function 1 fixed the issue.
    > However, information about initWithHTML:dataUsingEncoding:documentAttributes: running an event loop could be worth mentioning in the docs…
    >
    > Jean
    >
    > On 8 mai 2013, at 00:52, Ken Thomases wrote:
    >
    >> On May 7, 2013, at 5:37 PM, Jean Suisse wrote:
    >>
    >>> By any chance, could a call to [[NSAttributedString alloc] initWithHTML:dataUsingEncoding:documentAttributes:] lead to the event loop being run before the call returns?
    >>
    >> Yes, it can.  Under the hood, NSAttributedString is using WebKit for HTML rendering.  In part that means that, when invoked from a background thread, it has to shunt the work to the main thread.  But it also means the main thread may have to run the run loop during the call.  It's a nuisance, but it's necessary since HTML can have references to external resources that need to be loaded.
    >>
    >> Consider the other method, -[NSAttributedString initWithHTML:options:documentAttributes:] and its options dictionary, which can have keys like NSTimeoutDocumentOption and NSWebResourceLoadDelegateDocumentOption.  Those imply a pretty involved process under the hood.  I'm fairly certain that this method is used for the implementation of the method you're using.
    >>
    >> You may be able to use those options to minimize the use of the run loop, but probably not eliminate it.
    >>
    >> Regards,
    >> Ken
    >>
    >
    > -----------
    > Jean Suisse
    > Institut de Chimie Moléculaire de l’Université de Bourgogne
    > (ICMUB) — UMR 6302
    >
previous month may 2013 next month
MTWTFSS
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    
Go to today