Extract text from PDFs (even protected ones)
We're gonna be looking at how to extract text from PDFs. It's not fool-proof, but it's super simple and it does work most of the time.
1. Get the tools
Assuming that you're on Ubuntu Linux
sudo apt-get install --yes \ pdftk \ poppler-utils \ ghostscript
Or if you're on OS X
brew install \ pdftk \ gs
2. You'll hear it both ways
Let's say your PDF is named
INPUT.pdf (so fancy, I know)
You can try the most basic approach first and see what you get:
And whether or not your PDF is "protected" or has a password, sometimes you'll get an error saying that it thinks so.
If that happens to you, try this:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -c .setpdfwrite -f INPUT.pdf pdftotext OUTPUT.pdf
Just to experiment, you might try it both ways and see which method yields better data.
INPUT.txt and or
3. Split Pages, Extract Images
pdftkwill let you split pages TODO: show example
pdfimageswill sometimes pull embedded images out. I've had a mixed experience. TODO show example
By AJ ONeal
Did I make your day?
Buy me a coffee