Thursday, June 7, 2012

Tesseract training cheatsheat


As I wrote last time, tesseract-ocr is an open source ocr library originaly developed by
hp labs and today maintained by google. tesseract can be trained by you to support more languages and fonts. I have trained tesseract to read my hand writing and got success of over 90% - though this still means that once in every 10 characters or so there is an error. This page explains how you can train tesseract by yourself. This post will share some of the conclusions and pitfalls I have found from my experiment.

  • the tesseract training page is your friend. Follow the instructions!

  • The instructions work. Even if they are full of details and you are sure you (or the authors) got something wrong, remember that if you follow them carefully they will work for you.

  • One can't underestimate the important of a high quality picture for the ocr process. Use a good camera, make sure there is enough light. If you control the written text then a good paper and a thick pen are also helpful.

  • Scanner apps for mobile (CamScanneris my favirite) are critical, though they do not replace the need for a quality picture.

  • If you write on a blank paper, all text should be well aligned to virtual lines. Also no letter should stand out of the line (for example watch out not to write the letter 'p' too high over other letters.)

  • If you control the written text, consider to develop your own "font" so some of the ambiguous letters are really differentiated. For example I have decided to put an underline bellow all numbers and also under the letter n which can be mistaken for h or r.

  • I used the jTessBoxEditor for the box files. Its advantage over the other editors is that it supports multi page tiff files, which can be a good process to follow.

  • When you auto generate the box file the generated file may not identify some letters - that is in contrast of getting a letter wrong, it will not identify that there is a letter at all. From my experience there is no point in manually adding a box on that letter since it will never be identified. If too much letters are not identified you need to improve the quality of the photo or the ink and also make sure these letters are aligned in the same line as other letters.

  • if text lines start in the middle of a row, or if they are not nicely aligned one under the other, then there is a good chance tesseract will get them wrong.

  • sometimes tesseact got wrong the last line, but when I added a dummy line below it the real last line worked well.

  • when trying to recognice multi line text I got better resutls than when trying on a single line.

  • when trying on a single line I got better results when the image I used was not too large (so if the camera creats big pictures it is better to resize them)

  • when you create a box file make sure to use some existing language dictionary (if there is one) to bootstrp the identification. it does not matter which language you use since tesseract only uses it to generate the box file and it will not affect the final dictionary.

  • ImageMagick can be used to add some image to a multipage tiff file:

    convert.exe img1.bmp img2.jpg -adjoin res.tiff

    Common Errors

    Error: Illegal short name for a feature!
    signal_termination_handler:Error:Signal_termination_handler called:Code 2000

    I got this error after the .box file got corrupted for some reason. I have opened it and using "binary search" I deleted a different part of it every time and tried to build it again, until I found the wrong line. Typically the wrong line is because tesseract is identifying some very tiny dots as letters.

    Writing Merged Microfeat ...Warning: no protos/configs for { in

    Class->NumConfigs == this->fontset_table_.get(Class->font_set_id).size:Error:Assert failed:in file ..\classify\intproto.cpp, line 1312

    As stated here, tesseract 3.0.1 only supports one image per font. It actually crashs when you try to use another image (exp2). you may want to use multipage tiff file if you need multiple images. this way you can always push more images to an existing font without loosing the previous coordinates. Generating a box file for the new tiff will override the existing one (which you have probably manually fixed) so I have built a utility to backup the previous one and copy all values from previous tiff pages to the newly generated box file.

    read_params_file: Can't open batch.nochop

    The Windows executable package does not include the configs. You will need to copy the 'tessdata' from the source distribution to the same directory as tesseract.exe to perform training (e.g. the source has two folder under tessdata which we need, configs, tessconfigs)

    tessdata_manager.SeekToStart(TESSDATA_INTTEMP):Error:Assert failed:in file adaptmatch.cpp, line 512 Segmentation fault

    You did not follow documentation - before unificying to traindata you need to:
    "All you need to do now is collect together all (normproto, Microfeat, inttemp, pffmtable) the files and rename them with a lang. prefix..."


    What's next? get this blog rss updates or register for mail updates!

    blavatsky3 said...

    I get this error using Windows XP and Tesseract-OCR 3.01....
    How do I fix it or correct it ?

    Writing Merged Microfeat ...Warning: no protos/configs for { in

    Class->NumConfigs == this->fontset_table_.get(Class->font_set_id).size:Error:Assert failed:in file ..\classify\intproto.cpp, line 1312

    Yaron Naveh (MVP) said...

    I think you may try to use two fonts for the same language which is not supported