Skip to content

Tesseract: Google re-releases HP’s OCR tool

You can check it out here: http://sourceforge.net/projects/tesseract-ocr

Tesseract is a tool HP released a while ago for recognizing text within images.

I just compiled and ran it. Seems cool but only works with tiff files and so far I’ve only gotten it working with their sample included file. Using Ubuntu and with the standard GCC compiler stuff on there it compiles no problem.

Just do ‘./configure’ then ‘make’. ‘Make install’ does not seem to be supported yet, and the executable gets created within the ccmain directory.

I tried taking some PNG’s with plain text on a white background and using the ‘convert’ tool to convert it to a tiff. A tiff is created but the OCR cannot read it. Oh well.

Anyone else using this? Getting interesting results?

4 Comments

  1. david dahl wrote:

    I did the same thing… same results. did you try making the tif a bitonal image with imagemagick? I would love to get this tool integrated with python.

    Friday, September 15, 2006 at 6:58 am | Permalink
  2. nemik wrote:

    I eventually did get something but the detection still sucks pretty bad. I used the ‘convert’ command to adjust the contrast a lot to get something working but it only recognizes a few characters.

    Maybe Google will fix it for nicer OCR for its book publishing thing, but something tells me they won’t release it if they do.

    Friday, September 15, 2006 at 4:42 pm | Permalink
  3. Steve wrote:

    I got it to work pretty well for .pgm files for subtitle ripping from DVDs. Seemed to slightly outperform gocr. The files I am working with are strictly black and white, and I had to add a white border to the .tif file for it to function properly.

    #!/bin/bash

    INPUT=$1
    TITLE=$2
    LANG=$3
    COLOR=$4

    SUBTITLE=`mplayer -dvd-device $INPUT dvd://1 -vo null -ao null -frames 0 -v 2>&1 | grep sid | grep $LANG | awk -F’ ‘ ‘{print “0x” 20+$5}’`

    echo $SUBTITLE

    tccat -i $INPUT -T $TITLE -L | tcextract -x ps1 -t vob -a $SUBTITLE > subs-$3

    subtitle2pgm -o $3 -c $4 $LANG.srt

    Sunday, September 24, 2006 at 1:50 pm | Permalink
  4. Steve wrote:

    I guess that comment was too long.

    Anyway, the tesseract part was:

    convert $i.pgm \
    -bordercolor white \
    -border 10 \
    temp.tif
    tesseract temp.tif $i.pgm batch

    Sunday, September 24, 2006 at 1:52 pm | Permalink

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*