How to capture the CMC7 code with the Tesseract API?

Question

How to capture the CMC7 code with the Tesseract API?

Navigation

#1 by (-1 votes)

1

To contextualize my problem, I'm doing character reading in images using the Tesseract for Java API, tess4j . More specifically, the images are from bank checks, where I need to capture the code CMC7 . What happens is that the API can not recognize the font type of the code. I did a lot of research, I implemented the code that reads, but I did not succeed. Following:

Image to read:

Code:

public static void main(String[] args) {
    try {
        File imageFile = new File("D:/teste.png");
        ITesseract instance = new Tesseract(); // JNA Interface Mapping
        instance.setLanguage("mcr");
        String result = instance.doOCR(imageFile);
        System.out.println(result);
    } catch (Exception e) {
        System.err.println(e.getMessage());
    }
}

The source file used is mcr.traineddata .

After reading the image above, I am returned the following code: d8d0225255dd5582251558825 8515812888828888888858811118580112691188655888 212858801168185865810165125812086510 .

So what do I do?

java ocr tesseract

asked by anonymous 02.06.2017 / 16:39

1 answer

How to do a direct calculation in the database Router port does not open! [closed]

score -1 · Answer 1

There are some engines that do this hard work of manipulating images, making the extraction of their characters a relatively simple task. The best known is Tesseract, but it was not developed in Java. For this reason, we will use a JNA wrapper called Tess4J, which allows us to execute the native methods of this engine from Java.

Link to download TESS4J:

link

Downloading Tess4J Access the Tess4J project page and download the most current version.

Configuring the libraries Unzip the files below into the lib folder of your project:

win32-x86 / win32-x86-64 / commons-io-2.4.jar ghost4j-0.5.1.jar jai_imageio.jar jna-4.1.0.jar junit-4.10.jar log4j-1-2-17.jar tess4j.jar

Also unpack the tessdata folder at the root of your project:

tessdata /

Writing the image read code As an example, I'll use a scanned page I found through Google Images.

package br.com.danilotl.ocr;

import java.io.File;
import net.sourceforge.tess4j.*;

public class ReadImage {

    public static void main(String[] args){ 

        File imageFile = new File("page.jpg");
        Tesseract instance = Tesseract.getInstance();
        instance.setLanguage("eng");

        try {
            String result = instance.doOCR(imageFile);
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }
}

Let's look at the main points of the above code:

import java.io.File;
import net.sourceforge.tess4j.*;

Here we make the imports of the class java.io.File, responsible for creating a representation of the image file, and the classes of Tess4J, necessary for us to use the methods of its API.

File imageFile = new File("page.jpg");

Here we create an object of type File, passing in its constructor the path from where the image is located. In this case, the page.jpg file is at the root of the project.

Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");

Here we get an instance of the Tesseract class, and then we define the language in which the text of our image is written. In this case, the text of our image is in English. If you need to read other languages (such as Portuguese, which has accented characters, for example), you must download the file of the language in question in the Downloads section of the Tesseract page, unzip the file inside the tessdata folder, and define in your code the corresponding language.

try {
    String result = instance.doOCR(imageFile);
    System.out.println(result);
} catch (TesseractException e) {
    System.err.println(e.getMessage());
}

Finally, we read the image through the doOCR () method, passing the image as an argument, and then display the output in the Console. As we can compare, the reading is very accurate and contains very few errors.

This information is contained in the link below:

link