is there a way to extract text from pdfs?
-
is there anyway to give your cocoa app the capability to extract text
from already existing pdfs? strip out all the pdf related/embedded info
and just get the human readable text out?
i'd really like to be able to give my app the capability to do this and
i haven't got a clue where to start.
any ideas much appreciated. thanks.
_______________________________________________
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
Ben,
Google does this.
IIRC they based their tool on a commonly available GPL command line
tool that does this. If you use google you can probably find it.
There are also similar tools for PostScript.
--
To purchase it is not like spending money
but rather it is an investment in the future
in a blow against the empire
On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:
> is there anyway to give your cocoa app the capability to extract text_______________________________________________
> from already existing pdfs? strip out all the pdf related/embedded
> info and just get the human readable text out?
>
> i'd really like to be able to give my app the capability to do this
> and i haven't got a clue where to start.
>
> any ideas much appreciated. thanks.
> _______________________________________________
> cocoa-dev mailing list | <cocoa-dev...>
> Help/Unsubscribe/Archives:
> http://www.lists.apple.com/mailman/listinfo/cocoa-dev
> Do not post admin requests to the list. They will be ignored.
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:
> is there anyway to give your cocoa app the capability to extract text
> from already existing pdfs? strip out all the pdf related/embedded
> info and just get the human readable text out?
>
> i'd really like to be able to give my app the capability to do this
> and i haven't got a clue where to start.
Check out http://www.metaobject.com/
There's a product there called "TextLightning", which converts .pdf to
.rtf. I'd suggest talking to Marcel Weiher about what you're trying to
do.
-jcr
John C. Randolph <jcr...> (408) 974-8819
Sr. Cocoa Software Engineer,
Apple Worldwide Developer Relations
http://developer.apple.com/cocoa/index.html
_______________________________________________
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
>> is there anyway to give your cocoa app the capability to extract text
>> from already existing pdfs? strip out all the pdf related/embedded
>> info and just get the human readable text out?
>>
>> i'd really like to be able to give my app the capability to do this
>> and i haven't got a clue where to start.
>
> Check out http://www.metaobject.com/
>
> There's a product there called "TextLightning", which converts .pdf to
> .rtf. I'd suggest talking to Marcel Weiher about what you're trying
> to do.
thanks very much for the pointer. will do.
_______________________________________________
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
i was aware that google did this, but not that they had based it on a
freely available tool. i'll try and find it. thanks.
On Friday, March 14, 2003, at 05:02 pm, Karl Kraft wrote:
> Ben,_______________________________________________
>
> Google does this.
>
> IIRC they based their tool on a commonly available GPL command line
> tool that does this. If you use google you can probably find it.
>
> There are also similar tools for PostScript.
>
> --
> To purchase it is not like spending money
> but rather it is an investment in the future
> in a blow against the empire
>
>
>
> On Thursday, March 13, 2003, at 02:09 PM, Ben Dougall wrote:
>
>> is there anyway to give your cocoa app the capability to extract text
>> from already existing pdfs? strip out all the pdf related/embedded
>> info and just get the human readable text out?
>>
>> i'd really like to be able to give my app the capability to do this
>> and i haven't got a clue where to start.
>>
>> any ideas much appreciated. thanks.
>> _______________________________________________
>> cocoa-dev mailing list | <cocoa-dev...>
>> Help/Unsubscribe/Archives:
>> http://www.lists.apple.com/mailman/listinfo/cocoa-dev
>> Do not post admin requests to the list. They will be ignored.
> _______________________________________________
> cocoa-dev mailing list | <cocoa-dev...>
> Help/Unsubscribe/Archives:
> http://www.lists.apple.com/mailman/listinfo/cocoa-dev
> Do not post admin requests to the list. They will be ignored.
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
pdftotext is not too shabby:
http://www.sanface.com/pdfprint/pdftotext.html
included in:
http://www.foolabs.com/xpdf/download.html
Cheers,
PA.
_______________________________________________
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored. -
On Thursday, March 13, 2003, at 11:09 Uhr, Ben Dougall wrote:
> is there anyway to give your cocoa app the capability to extract text
> from already existing pdfs? strip out all the pdf related/embedded
> info and just get the human readable text out?
Install TextLightning ( http://www.metaobject.com/ ) It installs as a
filter service that automagically converts PDF to RTF for you. In your
code, you just have to use one of the "from RTF" methods in
NSAttributedString. When given a PDF file, the OS X services system
will invoke the filter service and hand you the converted RTF.
Incidentally, it's not really a matter of "stripping out" unneeded PDF
info, it is more a task of reconstructing a text-flow from clues left
in the PDF.
Marcel (creator of TextLightning)
--
Marcel Weiher Metaobject Software Technologies
<marcel...> www.metaobject.com
Metaprogramming for the Graphic Arts. HOM, IDEAs, MetaAd etc.
_______________________________________________
cocoa-dev mailing list | <cocoa-dev...>
Help/Unsubscribe/Archives: http://www.lists.apple.com/mailman/listinfo/cocoa-dev
Do not post admin requests to the list. They will be ignored.


