Wednesday, April 23, 2014

Parsing a PDF file in Excel

Every Linux distro comes with a handy utility called pdftotext. But you can use it on a windows machine, as well.  Using the browser of your choice, visit http://www.foolabs.com/xpdf/download.html, and download the precompiled xpdfbin-win-3.03.zip for x86 Windows.

Different windows versions and installs give you different default directories, so I'll tell you what I did.

1. The file you downloaded is a zip, so first unzip it.Then look in the subfolders - on my pc:

      C:\Users\Bruce\Documents\xpdfbin-win-3.03\xpdfbin-win-3.03\bin64

     if you are running on a 32 bit OS, there is also a

      C:\Users\Bruce\Documents\xpdfbin-win-3.03\xpdfbin-win-3.03\bin32


2. Go to Start>Run and enter cmd. That puts me in C:\Users\Bruce. Then enter cd Documents. My dos prompt now says

      C:\Users\Bruce\Documents

    That maps to the Documents folder on the start menu. In an explorer window, copy the pdtotext.exe file from the folder in step 1 to your Documents folder.

3. Put your PDF doc in the Documents folder. Now, from the dos prompt, enter:

      pdftotext <filename>.pdf -layout

In the explorer window, you should now see a file named <filename>.txt

If that gives you the results you are looking for, then this excel macro will probably make things easier - just change the line:

        exe = "C:\Users\Bruce\Documents\pdftotext.exe"

to reflect where your exe ended up:


1 comment:

  1. I admit, I have not been on this web page in a long time... however it was another joy to see It is such an important topic and ignored by so many, even professionals. professionals. I thank you to help making people more aware of possible issues. pdf to excel

    ReplyDelete