Extract text from PDFs (even protected ones)

We're gonna be looking at how to extract text from PDFs. It's not fool-proof, but it's super simple and it does work most of the time.

1. Get the tools

Assuming that you're on Ubuntu Linux

sudo apt-get install --yes \
  pdftk \
  poppler-utils \
  ghostscript

Or if you're on OS X

brew install \
  pdftk \
  gs

Let's say your PDF is named INPUT.pdf (so fancy, I know)

You can try the most basic approach first and see what you get:

pdftotext INPUT.pdf

And whether or not your PDF is "protected" or has a password, sometimes you'll get an error saying that it thinks so.

If that happens to you, try this:

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -c .setpdfwrite -f INPUT.pdf
pdftotext OUTPUT.pdf

Just to experiment, you might try it both ways and see which method yields better data.

You'll see INPUT.txt and or OUTPUT.txt

pdftk will let you split pages TODO: show example
pdfimages will sometimes pull embedded images out. I've had a mixed experience. TODO show example

By AJ ONeal

Did I make your day?

(you can learn about the bigger picture I'm working towards on my patreon page )