The following tutorial will explain how to extract all text from PDFs (including text in images), by using a combination of Ghostscript and a command line OCR tool called tesseract-ocr. This is yet another guest post by StoneCut.
First we need to convert our PDF to individual image files (TIFF) so we can then OCR-scan them again. We need Ghostscript for that. It's probably already installed on your system but just to be sure you can run:
First we need to convert our PDF to individual image files (TIFF) so we can then OCR-scan them again. We need Ghostscript for that. It's probably already installed on your system but just to be sure you can run:
sudo apt-get install ghostscript
Once we have ghostscript installed we can convert the actual PDF using the gs utility:
You will need to adjust the "Name_of_PDF.pdf" and "Output_File_Name.tif" from the above command for your purposes.
This will leave us with one large TIFF file (mine was about 10x as big as the original PDF) which we will now OCR-scan (Optical Character Recognition). We're going to use "tesseract-ocr" for that. But we need to install it first:
The package "tesseract-ocr-eng" is the English language recognition support and is REQUIRED for tesseract-ocr to work, no matter what locale your system is. Support for other languages is available in packages with their country code in them such as "tesseract-ocr-deu" for German language support.
Let's finally convert our big TIFF file to a TXT file including all the text, even that from images in the original PDF:
Again, adjust "Output_File_Name.tif" to the filename you originally choose and substitute your desired TXT file name for "Name_of_TXT" - leave out the *.txt extension. If your PDF isn't in English, then set the "-l eng" accordingly to the package for your language support you installed earlier.
That's it. Check out the resultant TXT file.
Please note: the quality of extracted text from the images inside the PDF is highly dependent on the quality of the original PDF's images.
gs -dNOPAUSE -sDEVICE=tiffg4 -r600x600 -dBATCH -sPAPERSIZE=a4 -sOutputFile=Output_File_Name.tif Name_of_PDF.pdf
You will need to adjust the "Name_of_PDF.pdf" and "Output_File_Name.tif" from the above command for your purposes.
This will leave us with one large TIFF file (mine was about 10x as big as the original PDF) which we will now OCR-scan (Optical Character Recognition). We're going to use "tesseract-ocr" for that. But we need to install it first:
sudo apt-get install tesseract-ocr tesseract-ocr-eng
The package "tesseract-ocr-eng" is the English language recognition support and is REQUIRED for tesseract-ocr to work, no matter what locale your system is. Support for other languages is available in packages with their country code in them such as "tesseract-ocr-deu" for German language support.
Let's finally convert our big TIFF file to a TXT file including all the text, even that from images in the original PDF:
tesseract Output_File_Name.tif Name_of_TXT -l eng
Again, adjust "Output_File_Name.tif" to the filename you originally choose and substitute your desired TXT file name for "Name_of_TXT" - leave out the *.txt extension. If your PDF isn't in English, then set the "-l eng" accordingly to the package for your language support you installed earlier.
That's it. Check out the resultant TXT file.
Please note: the quality of extracted text from the images inside the PDF is highly dependent on the quality of the original PDF's images.
This is a guest post written by StoneCut (thank you very much!). Browse all the posts by StoneCut. |