You can check it out here: http://sourceforge.net/projects/tesseract-ocr

Tesseract is a tool HP released a while ago for recognizing text within images.

I just compiled and ran it. Seems cool but only works with tiff files and so far I’ve only gotten it working with their sample included file. Using Ubuntu and with the standard GCC compiler stuff on there it compiles no problem.

Just do ‘./configure’ then ‘make’. ‘Make install’ does not seem to be supported yet, and the executable gets created within the ccmain directory.

I tried taking some PNG’s with plain text on a white background and using the ‘convert’ tool to convert it to a tiff. A tiff is created but the OCR cannot read it. Oh well.

Anyone else using this? Getting interesting results?

4 Responses to “Tesseract: Google re-releases HP’s OCR tool”

  1. david dahl Says:

    I did the same thing… same results. did you try making the tif a bitonal image with imagemagick? I would love to get this tool integrated with python.

  2. nemik Says:

    I eventually did get something but the detection still sucks pretty bad. I used the ‘convert’ command to adjust the contrast a lot to get something working but it only recognizes a few characters.

    Maybe Google will fix it for nicer OCR for its book publishing thing, but something tells me they won’t release it if they do.

  3. Steve Says:

    I got it to work pretty well for .pgm files for subtitle ripping from DVDs. Seemed to slightly outperform gocr. The files I am working with are strictly black and white, and I had to add a white border to the .tif file for it to function properly.

    #!/bin/bash

    INPUT=$1
    TITLE=$2
    LANG=$3
    COLOR=$4

    SUBTITLE=`mplayer -dvd-device $INPUT dvd://1 -vo null -ao null -frames 0 -v 2>&1 | grep sid | grep $LANG | awk -F’ ‘ ‘{print “0x” 20+$5}’`

    echo $SUBTITLE

    tccat -i $INPUT -T $TITLE -L | tcextract -x ps1 -t vob -a $SUBTITLE > subs-$3

    subtitle2pgm -o $3 -c $4 $LANG.srt

  4. Steve Says:

    I guess that comment was too long.

    Anyway, the tesseract part was:

    convert $i.pgm \
    -bordercolor white \
    -border 10 \
    temp.tif
    tesseract temp.tif $i.pgm batch

Leave a Reply