Blog

Pulling Content out of Word with ColdFusion 9

August 4, 2009 · 12 Comments

I had a 1 + 1 = 2 moment the other day. I was fooling around with the ColdFusion's ability to turn Word docs into PDFs. At first glance it's pretty simple and straightforward:

<cfset src = expandPath("./cf9.docx") />
<cfset result = expandPath("./cf9.pdf") />
<cfdocument filename="#result#" srcfile="#src#" format="pdf" />

Word to PDF is nice to have, but as features go, it's a pretty small bullet point. Don't get me wrong, you get fidelity to the original, including fonts, layouts, and images. But it's still just converting a Word document to a PDF.

That is until you remember that you can pull content out of PDFs now in ColdFusion 9. So now you can do this:

<cfset src = expandPath("./cf9.docx") />
<cfset result = expandPath("./cf9.pdf") />
<cfdocument filename="#result#" srcfile="#src#" format="pdf" overwrite="true" />

<cfpdf action="extracttext"
      source="#result#"
      name="cfref"
       />


<cfdump var="#XMLparse(cfref)#" >

This will yield you the content of the original Word document. Now that's cool.

Tags: ColdFusion

12 response s so far ↓

  • 1 Mingo Hagen // Aug 4, 2009 at 9:18 AM

    while cool, I'd like to have <cfdocument action="extracttext">.
  • 2 Ben Nadel // Aug 4, 2009 at 9:19 AM

    Awesome! I didn't even know ColdFusion could convert word documents into PDFs! Bitchy!
  • 3 John Farrar // Aug 4, 2009 at 11:05 PM

    Now that has some serious potential. Will have to
    start dumping and see what pragmatic use this can have! :)
  • 4 Ben Spencer // Aug 11, 2009 at 1:57 AM

    I havent downloaded CF9 yet, but this example uses .docx as the document format. Can the same be done for the old .doc format which wasn't XML based?
  • 5 Terrence Ryan // Aug 11, 2009 at 4:39 PM

    Ben: I haven't tested that myself, but I'm pretty sure we've said it works. ;)
  • 6 Ben Spencer // Aug 11, 2009 at 5:27 PM

    Thanks Terrence, I read up on it in the end. Yep, it should do .doc quite nicely (with whatever quirks are associated with OpenOffice).

    Use #1 for me: Produce thumbnails of word docs for document management application.
  • 7 Juan Escalada // Sep 22, 2009 at 7:10 AM

    Dear Terrence,

    A client of mine is asking if a document could be uploaded so that the document´s footnotes would be stripped and, together with the associated paragraph, be emailed to different people (As in Footnote 1 and its paragraph goes to Adam for check-up and footnote 2 goes to Joe)...
    I could convince the client to use a PDF if that mad ethings any easier... But I´d appreciate your insigt to know wether this would be at all possible...

    Thanks in advance, Juan Escalada.
  • 8 Don Blaire // Apr 23, 2010 at 7:22 PM

    One of our departments downloads a Word doc in which they want to get the content from. This blog was just what I was looking for. Thanks.
  • 9 Tad // May 17, 2010 at 4:02 PM

    Terrence - thanks for that. I'm trying to use CF9 to pull summary information out of DOCX files. HWPF doesn't do it, wondering if CF9 can do it based on DOCX support. Do you know the best way of going about such? Leery to do it by converting to pdf first and then using <CFPDF> to extract, as the summary info (createTime, etc) may have been mangled.
  • 10 Virginia Neal // May 28, 2010 at 7:30 PM

    The new extracttext action is nice to have and I am able to use it to simply pull the text of a pdf document.

    However, I have a real need to keep the general format of the document as well. This would simply be things line centering, indenting/tabs, line breaks, etc.

    I had hoped that using useStructure="true" along with honourspaces="true" would return the basic format of the document, but that does not seem to be the case.

    Do you know if it is possible to maintain the basic formatting of a PDF document?

    Thanks

    BTW - the PDF documents that I am working with began as Word and were converted to PDF using your suggested approach (thank you).
  • 11 Cheyenne Throckmorton // Jun 14, 2010 at 4:14 PM

    @Virginia

    I was running into a similar problem in the last week. I used an alternative solution that worked in my use case that may or may not help. I still converted the word documents to pdfs then I used the thumbnail capability to create jpg images at 100% resolution and then displayed those images to the end user which worked well for our use case.

    While I am looking forward to diving more into the capabilities here and with DDX I think the overall problem we both encountered is that DDX is about the actual content of the pdf devoid of styling, similar to how HTML "is supposed to be". With HTML you add in a CSS file to style, and I believe there is similar type of document that combines with the DDX to create your fully stylized PDF.

    Thats obviously a lot more work to dig into and I only had the time to do the thumbnail solution for now, but hope that helps.

    @Terrence thanks for the blog and great tips that even got me started on solving my use case.
  • 12 Nick // Aug 7, 2010 at 4:49 PM

    Just want to confirm that an OpenOffice install on the server is needed to convert word documents to PDFs using coldfusion 9.

    Please correct me if I'm wrong, but I just ran Terrence's code and received a error that an openOffice install was required.

Leave a Comment