We're gonna be looking at how to extract text from PDFs. It's not fool-proof, but it's super simple and it does work most of the time.

1. Get the tools

Assuming that you're on Ubuntu Linux

sudo apt-get install --yes \
  pdftk \
  poppler-utils \

Or if you're on OS X

brew install \
  pdftk \

2. You'll hear it both ways

Let's say your PDF is named INPUT.pdf (so fancy, I know)

You can try the most basic approach first and see what you get:

pdftotext INPUT.pdf

And whether or not your PDF is "protected" or has a password, sometimes you'll get an error saying that it thinks so.

If that happens to you, try this:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -c .setpdfwrite -f INPUT.pdf
pdftotext OUTPUT.pdf

Just to experiment, you might try it both ways and see which method yields better data.

You'll see INPUT.txt and or OUTPUT.txt

3. Split Pages, Extract Images

  • pdftk will let you split pages TODO: show example
  • pdfimages will sometimes pull embedded images out. I've had a mixed experience. TODO show example

