TerrenceRyan.com

I'm a 35 year old redhead geek from Philly.
I'm currently a Developer Evangelist for Adobe.
Also the author of Driving Technical Change

Pulling Content out of Word with ColdFusion 9

14 Comments

I had a 1 + 1 = 2 moment the other day. I was fooling around with the ColdFusion's ability to turn Word docs into PDFs. At first glance it's pretty simple and straightforward:

<cfset src = expandPath("./cf9.docx") />
<cfset result = expandPath("./cf9.pdf") />
<cfdocument filename="#result#" srcfile="#src#" format="pdf" />

Word to PDF is nice to have, but as features go, it's a pretty small bullet point. Don't get me wrong, you get fidelity to the original, including fonts, layouts, and images. But it's still just converting a Word document to a PDF.

That is until you remember that you can pull content out of PDFs now in ColdFusion 9. So now you can do this:

<cfset src = expandPath("./cf9.docx") />
<cfset result = expandPath("./cf9.pdf") />
<cfdocument filename="#result#" srcfile="#src#" format="pdf" overwrite="true" />

<cfpdf action="extracttext"
      source="#result#"
      name="cfref"
       />


<cfdump var="#XMLparse(cfref)#" >

This will yield you the content of the original Word document. Now that's cool.

14 responses so far ↓

  • 1 Mingo Hagen

    while cool, I'd like to have <cfdocument action="extracttext">.
  • 2 Ben Nadel

    Awesome! I didn't even know ColdFusion could convert word documents into PDFs! Bitchy!
  • 3 John Farrar

    Now that has some serious potential. Will have to
    start dumping and see what pragmatic use this can have! :)
  • 4 Ben Spencer

    I havent downloaded CF9 yet, but this example uses .docx as the document format. Can the same be done for the old .doc format which wasn't XML based?
  • 5 Terrence Ryan

    Ben: I haven't tested that myself, but I'm pretty sure we've said it works. ;)
  • 6 Ben Spencer

    Thanks Terrence, I read up on it in the end. Yep, it should do .doc quite nicely (with whatever quirks are associated with OpenOffice).

    Use #1 for me: Produce thumbnails of word docs for document management application.
  • 7 Juan Escalada

    Dear Terrence,

    A client of mine is asking if a document could be uploaded so that the document´s footnotes would be stripped and, together with the associated paragraph, be emailed to different people (As in Footnote 1 and its paragraph goes to Adam for check-up and footnote 2 goes to Joe)...
    I could convince the client to use a PDF if that mad ethings any easier... But I´d appreciate your insigt to know wether this would be at all possible...

    Thanks in advance, Juan Escalada.
  • 8 Don Blaire

    One of our departments downloads a Word doc in which they want to get the content from. This blog was just what I was looking for. Thanks.
  • 9 Tad

    Terrence - thanks for that. I'm trying to use CF9 to pull summary information out of DOCX files. HWPF doesn't do it, wondering if CF9 can do it based on DOCX support. Do you know the best way of going about such? Leery to do it by converting to pdf first and then using <CFPDF> to extract, as the summary info (createTime, etc) may have been mangled.
  • 10 Virginia Neal

    The new extracttext action is nice to have and I am able to use it to simply pull the text of a pdf document.

    However, I have a real need to keep the general format of the document as well. This would simply be things line centering, indenting/tabs, line breaks, etc.

    I had hoped that using useStructure="true" along with honourspaces="true" would return the basic format of the document, but that does not seem to be the case.

    Do you know if it is possible to maintain the basic formatting of a PDF document?

    Thanks

    BTW - the PDF documents that I am working with began as Word and were converted to PDF using your suggested approach (thank you).
  • 11 Cheyenne Throckmorton

    @Virginia

    I was running into a similar problem in the last week. I used an alternative solution that worked in my use case that may or may not help. I still converted the word documents to pdfs then I used the thumbnail capability to create jpg images at 100% resolution and then displayed those images to the end user which worked well for our use case.

    While I am looking forward to diving more into the capabilities here and with DDX I think the overall problem we both encountered is that DDX is about the actual content of the pdf devoid of styling, similar to how HTML "is supposed to be". With HTML you add in a CSS file to style, and I believe there is similar type of document that combines with the DDX to create your fully stylized PDF.

    Thats obviously a lot more work to dig into and I only had the time to do the thumbnail solution for now, but hope that helps.

    @Terrence thanks for the blog and great tips that even got me started on solving my use case.
  • 12 Nick

    Just want to confirm that an OpenOffice install on the server is needed to convert word documents to PDFs using coldfusion 9.

    Please correct me if I'm wrong, but I just ran Terrence's code and received a error that an openOffice install was required.
  • 13 Ed

    Yes you need OpenOffice installed, then add the directory in the admin settings under 'Documents' section. I tried it w/o the software and it did make a "pdf" file but with garbage in it as if you're viewing a word doc with textpad.

    "When you use cfdocument to convert a document file, the tag first checks for an OpenOffice installation. When the OpenOffice installation is found, the tag processes the rich text conversion through the OpenOffice libraries."

    One odd thing to note is that if the file extension is '.doc' it returned garbage (even with openoffice) as the pdf output but with the same file renamed to ".docx" it worked.
  • 14 vakantiehuis

    Leuke site!. Er zijn nog weinig goede sites over dit onderwerp te vinden.
    Ben blij met jullie post!
    Ik kan helaas geen bookmark aanmaken naar www.terrenceryan.com in Firefox. :( Weten jullie hoe dit komt?

    Groetjes Barbara

Leave a Comment









Categories

Monthly Archives

Tag Cloud

coldfusion web development flex coldfusion builder appearances squidhead coldfusion builder extensions higher ed flash builder air mobile android adobe apptacular html5 driving technical change running a coldfusion shop adobemax06 movable type flash catalyst flash blackberry adobemax07 adobemax08 hero finicky css adobemax09 holy crap i’m a mobile developer centaur basecamp cfc unfuddle motorola metablog irrational characters ios git evangelism devices code reviews ant wharton subversion security phonegap philly philadelphia multidevice knowledge@wharton jobs browserlab adobemax10 adobe tv unfuddlecfc svnauth.cfc semantic html semantic html responsive web design qnx nlb linux jquery mobile java it github flexorg fireworks edge eclipse dreamweaver apps apple adobemax11