Tagged: tesseract

Free OCR Using Terminal and Tesseract on OSX

Since I’m in the middle of my doctoral studies, I read A LOT of journal articles. Most of these articles are in PDF file format and I use Skim to read and annotate them. However, every so often I can only obtain PDFs that are images. These PDFs can’t be searched or annotated and for my workflow this is a no go. I know of programs that will automatically OCR (object character recognition) documents like DEVONthink Pro Office and PDFpen, but 1) I’m on a grad school budget and 2) I like the challenge of figuring out ways to configure and promote technology using open source resources.

So this past weekend I decided that I wanted to OCR some image-only PDFs into searchable PDFs that could also be annotated correctly. The following is the fruit of my journey so far. I don’t know if this journey is over, but I can tell you my OCR process works well enough for now. This is written for those who have never (or barely) used the Terminal app on OSX and are new to Tesseract and ORC.

A lot of credit goes to mchristy at the Early Modern OCR Project (http://emop.tamu.edu/Installing-Tesseract-Mac) as you’ll notice many, but not all things are the same as he outlined for OSX 10.8.

DISCLAIMER: Attempting the process outlined below may cause problems with the operation of your computer or cause you to lose data. Consider yourself warned… you are attempting this at your own risk. I HIGHLY recommend backing up your system before you do anything like what’s described below.

I was able to get Tesseract 3.03 release candidate to build on OSX 10.9.4 from source (https://code.google.com/p/tesseract-ocr/wiki/TesseractSvnInstallation) and it is working with some warnings (detailed below). Everything built well and without errors (note: I did have warnings, but no errors.). Slight detour – If you know what MacPorts and Homebrew are great, but I had trouble building Tesseract 3.03 when I had both installed on my machine so my recommendation is only use Homebrew.

I have tested Tesseract with TIFF (single and multiple pages) and it is working well. It gives me the following error in which the page # is always the last page of the file, but it doesn’t seem to be a problem.

“Warning in pixReadMemTiff: tiff page 25 not found”

PNG files do not seem to work as inputs (it outputs two identically named files: one that can’t be opened and one that only has the first page of the input).

PDF files provide the following error and I can’t remember if Leptonica is supposed to be able to input PDF files or not. Another problem for another day.

Error in fopenReadStream: file not found
Error in pixRead: image file not found: %PDF-1.2
Image file %PDF-1.2 cannot be read!
Error during processing.

I can work on these if I find time, but since TIFF is working they aren’t a priority.

OK… now that all of that is out of the way, here is the process that worked for me. For reference:

https://code.google.com/p/tesseract-ocr/wiki/TesseractSvnInstallation –  will be referred to as [1] below.

1. Open Terminal
2. Install, update, and verify Homebrew by entering the following in terminal one at a time (aka. hit return/enter after each):

ruby -e “$(curl -fsSL https://raw.github.com/Homebrew/homebrew/go/install)”
brew update
brew doctor

3. Make sure brew doctor comes back clean. If there are errors fix them.

4. Install the tesseract dependencies listed at [1] above again by entering one at a time. (Note: I did not need to install aclocal or autoheader from Homebrew as they aren’t formulas in Homebrew.)

brew install autoconf
brew install automake
brew install libtool
brew install leptonica –with-libtiff

5 .Assuming everything above installed without errors, run the following commands (still in Terminal entering one at a time) (again based on the instructions in [1]):

svn checkout http://tesseract-ocr.googlecode.com/svn/trunk/ tesseract-ocr
cd tesseract-ocr
./autogen.sh
./configure
make
sudo make install
sudo make install-langs

6. Assuming you don’t get any failures or errors, you can then test whether or not your OCR works by doing the following.

A. If you don’t have a .TIFF file, open the file with Preview and Export the file as a *.TIFF (I use 300 pixels per inch.)
B. In Terminal navigate to the directory that contains the .TIFF file from above.
C. Use the following commands in Terminal. (Note: the italics should be change to your input/ouput docs specific filenames and the filetype you want to output) (Another note: Tesseract defaults its output to .TXT files of you don’t specific an output file type.)

tesseract inputfilename.tiff outputfilename outputfiletype

For example: “tesseract mytiff.tiff mysearchablepdf pdf” should make “mytiff.tiff” into a searchable pdf with the name “mysearchablepdf.pdf” and save it into whatever directory you issue the tesseract command from in Terminal.

Whew!! That was a lot. Hopefully this helps someone.