is there a way to extract text from pdfs?

  • is there anyway to give your cocoa app the capability to extract text
    from already existing pdfs? strip out all the pdf related/embedded info
    and just get the human readable text out?

    i'd really like to be able to give my app the capability to do this and
    i haven't got a clue where to start.

    any ideas much appreciated. thanks.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • Ben,

    Google does this.

    IIRC they based their tool on a commonly available GPL command line
    tool that does this.  If you use google you can probably find it.

    There are also similar tools for PostScript.

    --
    To purchase it is not like spending money
    but rather it is an investment in the future
    in a blow against the empire

    On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:

    > is there anyway to give your cocoa app the capability to extract text
    > from already existing pdfs? strip out all the pdf related/embedded
    > info and just get the human readable text out?
    >
    > i'd really like to be able to give my app the capability to do this
    > and i haven't got a clue where to start.
    >
    > any ideas much appreciated. thanks.
    > _______________________________________________
    > cocoa-dev mailing list | <cocoa-dev...>
    > Help/Unsubscribe/Archives:
    > http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    > Do not post admin requests to the list. They will be ignored.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:

    > is there anyway to give your cocoa app the capability to extract text
    > from already existing pdfs? strip out all the pdf related/embedded
    > info and just get the human readable text out?
    >
    > i'd really like to be able to give my app the capability to do this
    > and i haven't got a clue where to start.

    Check out http://www.metaobject.com/

    There's a product there called "TextLightning", which converts .pdf to
    .rtf.  I'd suggest talking to Marcel Weiher about what you're trying to
    do.

    -jcr

    John C. Randolph    <jcr...>  (408) 974-8819
    Sr. Cocoa Software Engineer,
    Apple Worldwide Developer Relations
    http://developer.apple.com/cocoa/index.html
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • >> is there anyway to give your cocoa app the capability to extract text
    >> from already existing pdfs? strip out all the pdf related/embedded
    >> info and just get the human readable text out?
    >>
    >> i'd really like to be able to give my app the capability to do this
    >> and i haven't got a clue where to start.
    >
    > Check out http://www.metaobject.com/
    >
    > There's a product there called "TextLightning", which converts .pdf to
    > .rtf.  I'd suggest talking to Marcel Weiher about what you're trying
    > to do.

    thanks very much for the pointer. will do.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • i was aware that google did this, but not that they had based it on a
    freely available tool. i'll try and find it. thanks.

    On Friday, March 14, 2003, at 05:02  pm, Karl Kraft wrote:

    > Ben,
    >
    > Google does this.
    >
    > IIRC they based their tool on a commonly available GPL command line
    > tool that does this.  If you use google you can probably find it.
    >
    > There are also similar tools for PostScript.
    >
    > --
    > To purchase it is not like spending money
    > but rather it is an investment in the future
    > in a blow against the empire
    >
    >
    >
    > On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:
    >
    >> is there anyway to give your cocoa app the capability to extract text
    >> from already existing pdfs? strip out all the pdf related/embedded
    >> info and just get the human readable text out?
    >>
    >> i'd really like to be able to give my app the capability to do this
    >> and i haven't got a clue where to start.
    >>
    >> any ideas much appreciated. thanks.
    >> _______________________________________________
    >> cocoa-dev mailing list | <cocoa-dev...>
    >> Help/Unsubscribe/Archives:
    >> http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    >> Do not post admin requests to the list. They will be ignored.
    > _______________________________________________
    > cocoa-dev mailing list | <cocoa-dev...>
    > Help/Unsubscribe/Archives:
    > http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    > Do not post admin requests to the list. They will be ignored.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • pdftotext is not too shabby:

    http://www.sanface.com/pdfprint/pdftotext.html

    included in:
    http://www.foolabs.com/xpdf/download.html

    Cheers,

    PA.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
  • On Thursday, March 13, 2003, at 11:09  Uhr, Ben Dougall wrote:

    > is there anyway to give your cocoa app the capability to extract text
    > from already existing pdfs? strip out all the pdf related/embedded
    > info and just get the human readable text out?

    Install TextLightning  ( http://www.metaobject.com/ ) It installs as a
    filter service that automagically converts PDF to RTF for you.  In your
    code, you just have to use one of the "from RTF" methods in
    NSAttributedString.  When given a PDF file, the OS X services system
    will invoke the filter service and hand you the converted RTF.

    Incidentally, it's not really a matter of "stripping out" unneeded PDF
    info, it is more a task of reconstructing a text-flow from clues left
    in the PDF.

    Marcel  (creator of TextLightning)
    --
    Marcel Weiher                Metaobject Software Technologies
    <marcel...>        www.metaobject.com
    Metaprogramming for the Graphic Arts.  HOM, IDEAs, MetaAd etc.
    _______________________________________________
    cocoa-dev mailing list | <cocoa-dev...>
    Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
    Do not post admin requests to the list. They will be ignored.
previous month march 2003 next month
MTWTFSS
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            
Go to today