Extract text from PDFs (even protected ones)
Published 2015-2-13We're gonna be looking at how to extract text from PDFs. It's not fool-proof, but it's super simple and it does work most of the time.
1. Get the tools
Assuming that you're on Ubuntu Linux
sudo apt-get install --yes \
pdftk \
poppler-utils \
ghostscript
Or if you're on OS X
brew install \
pdftk \
gs
2. You'll hear it both ways
Let's say your PDF is named INPUT.pdf
(so fancy, I know)
You can try the most basic approach first and see what you get:
pdftotext INPUT.pdf
And whether or not your PDF is "protected" or has a password, sometimes you'll get an error saying that it thinks so.
If that happens to you, try this:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=OUTPUT.pdf -c .setpdfwrite -f INPUT.pdf
pdftotext OUTPUT.pdf
Just to experiment, you might try it both ways and see which method yields better data.
You'll see INPUT.txt
and or OUTPUT.txt
3. Split Pages, Extract Images
pdftk
will let you split pages TODO: show examplepdfimages
will sometimes pull embedded images out. I've had a mixed experience. TODO show example
By AJ ONeal
Thanks!
It's really motivating to know that people like you are benefiting
from what I'm doing and want more of it. :)
Did I make your day?
Buy me a coffee
(you can learn about the bigger picture I'm working towards on my patreon page )