Tesseract: Google re-releases HP’s OCR tool
September 5th, 2006
You can check it out here: http://sourceforge.net/projects/tesseract-ocr
Tesseract is a tool HP released a while ago for recognizing text within images.
I just compiled and ran it. Seems cool but only works with tiff files and so far I’ve only gotten it working with their sample included file. Using Ubuntu and with the standard GCC compiler stuff on there it compiles no problem.
Just do ‘./configure’ then ‘make’. ‘Make install’ does not seem to be supported yet, and the executable gets created within the ccmain directory.
I tried taking some PNG’s with plain text on a white background and using the ‘convert’ tool to convert it to a tiff. A tiff is created but the OCR cannot read it. Oh well.
Anyone else using this? Getting interesting results?
September 15th, 2006 at 6:58 am
I did the same thing… same results. did you try making the tif a bitonal image with imagemagick? I would love to get this tool integrated with python.
September 15th, 2006 at 4:42 pm
I eventually did get something but the detection still sucks pretty bad. I used the ‘convert’ command to adjust the contrast a lot to get something working but it only recognizes a few characters.
Maybe Google will fix it for nicer OCR for its book publishing thing, but something tells me they won’t release it if they do.
September 24th, 2006 at 1:50 pm
I got it to work pretty well for .pgm files for subtitle ripping from DVDs. Seemed to slightly outperform gocr. The files I am working with are strictly black and white, and I had to add a white border to the .tif file for it to function properly.
#!/bin/bash
INPUT=$1
TITLE=$2
LANG=$3
COLOR=$4
SUBTITLE=`mplayer -dvd-device $INPUT dvd://1 -vo null -ao null -frames 0 -v 2>&1 | grep sid | grep $LANG | awk -F’ ‘ ‘{print “0x” 20+$5}’`
echo $SUBTITLE
tccat -i $INPUT -T $TITLE -L | tcextract -x ps1 -t vob -a $SUBTITLE > subs-$3
subtitle2pgm -o $3 -c $4 $LANG.srt
September 24th, 2006 at 1:52 pm
I guess that comment was too long.
Anyway, the tesseract part was:
convert $i.pgm \
-bordercolor white \
-border 10 \
temp.tif
tesseract temp.tif $i.pgm batch