How to capture the CMC7 code with the Tesseract API?

1

To contextualize my problem, I'm doing character reading in images using the Tesseract for Java API, tess4j . More specifically, the images are from bank checks, where I need to capture the code CMC7 . What happens is that the API can not recognize the font type of the code. I did a lot of research, I implemented the code that reads, but I did not succeed. Following:

Image to read:

Code:

public static void main(String[] args) {
    try {
        File imageFile = new File("D:/teste.png");
        ITesseract instance = new Tesseract(); // JNA Interface Mapping
        instance.setLanguage("mcr");
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (Exception e) {
        System.err.println(e.getMessage());
    }
}

The source file used is mcr.traineddata .

After reading the image above, I am returned the following code: d8d0225255dd5582251558825 8515812888828888888858811118580112691188655888 212858801168185865810165125812086510 .

So what do I do?

    
asked by anonymous 02.06.2017 / 16:39

1 answer

-1

There are some engines that do this hard work of manipulating images, making the extraction of their characters a relatively simple task. The best known is Tesseract, but it was not developed in Java. For this reason, we will use a JNA wrapper called Tess4J, which allows us to execute the native methods of this engine from Java.

Link to download TESS4J:

link

  • Downloading Tess4J Access the Tess4J project page and download the most current version.

  • Configuring the libraries Unzip the files below into the lib folder of your project:

    win32-x86 /  win32-x86-64 /  commons-io-2.4.jar  ghost4j-0.5.1.jar  jai_imageio.jar  jna-4.1.0.jar  junit-4.10.jar  log4j-1-2-17.jar  tess4j.jar

  • Also unpack the tessdata folder at the root of your project:

    tessdata /

  • Writing the image read code As an example, I'll use a scanned page I found through Google Images.
  • package br.com.danilotl.ocr;
    
    import java.io.File;
    import net.sourceforge.tess4j.*;
    
    public class ReadImage {
    
        public static void main(String[] args){ 
    
            File imageFile = new File("page.jpg");
            Tesseract instance = Tesseract.getInstance();
            instance.setLanguage("eng");
    
            try {
                String result = instance.doOCR(imageFile);
                System.out.println(result);
            } catch (TesseractException e) {
                System.err.println(e.getMessage());
            }
        }
    }
    

    Let's look at the main points of the above code:

    import java.io.File;
    import net.sourceforge.tess4j.*;
    

    Here we make the imports of the class java.io.File, responsible for creating a representation of the image file, and the classes of Tess4J, necessary for us to use the methods of its API.

    File imageFile = new File("page.jpg");
    

    Here we create an object of type File, passing in its constructor the path from where the image is located. In this case, the page.jpg file is at the root of the project.

    Tesseract instance = Tesseract.getInstance();
    instance.setLanguage("eng");
    

    Here we get an instance of the Tesseract class, and then we define the language in which the text of our image is written. In this case, the text of our image is in English. If you need to read other languages (such as Portuguese, which has accented characters, for example), you must download the file of the language in question in the Downloads section of the Tesseract page, unzip the file inside the tessdata folder, and define in your code the corresponding language.

    try {
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (TesseractException e) {
        System.err.println(e.getMessage());
    }
    

    Finally, we read the image through the doOCR () method, passing the image as an argument, and then display the output in the Console. As we can compare, the reading is very accurate and contains very few errors.

    This information is contained in the link below:

    link

        
    16.06.2017 / 15:53