Limit characters in portable tesseract

3

Currently I'm using tesseract portable integrated with java to be able to identify some characters, but I'm facing some problems like:

Some fields only date as: 01/02/2013

It looks something like this: 0Il0S/S013

It just does not follow any pattern. Does anyone have any idea if you can create a default dictionary only for characters like 0-9 and / ?

Remembering: I know it exists for C, but the portable version has not yet found.

    
asked by anonymous 27.04.2015 / 21:15

1 answer

1

I have only used tesseract on Linux, via the command line, or in scripts that send the command line do the work ...

1) create a configuration file mydata with valid characters:

tessedit_char_whitelist 0123456789/-

2) then invoke the tesseract as:

tesseract f.png zzz   mydata

producing zzz.txt only with digits and '/' and '-'

For good results it is worth investing in the quality (resolution) of the initial image ...

If the scope is wider it will probably be useful to indicate the language.

It is natural that the Java interface, C, etc. have functionality to define the "whitelists".

There is also the possibility of retraining tesseracts (I doubt it is justified).

    
27.04.2015 / 23:44