There are some engines that do this hard work of manipulating images, making the extraction of their characters a relatively simple task. The best known is Tesseract, but it was not developed in Java. For this reason, we will use a JNA wrapper called Tess4J, which allows us to execute the native methods of this engine from Java.
Link to download TESS4J:
link
Downloading Tess4J
Access the Tess4J project page and download the most current version.
Configuring the libraries
Unzip the files below into the lib folder of your project:
win32-x86 /
win32-x86-64 /
commons-io-2.4.jar
ghost4j-0.5.1.jar
jai_imageio.jar
jna-4.1.0.jar
junit-4.10.jar
log4j-1-2-17.jar
tess4j.jar
Also unpack the tessdata folder at the root of your project:
tessdata /
Writing the image read code
As an example, I'll use a scanned page I found through Google Images.
package br.com.danilotl.ocr;
import java.io.File;
import net.sourceforge.tess4j.*;
public class ReadImage {
public static void main(String[] args){
File imageFile = new File("page.jpg");
Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Let's look at the main points of the above code:
import java.io.File;
import net.sourceforge.tess4j.*;
Here we make the imports of the class java.io.File, responsible for creating a representation of the image file, and the classes of Tess4J, necessary for us to use the methods of its API.
File imageFile = new File("page.jpg");
Here we create an object of type File, passing in its constructor the path from where the image is located. In this case, the page.jpg file is at the root of the project.
Tesseract instance = Tesseract.getInstance();
instance.setLanguage("eng");
Here we get an instance of the Tesseract class, and then we define the language in which the text of our image is written. In this case, the text of our image is in English. If you need to read other languages (such as Portuguese, which has accented characters, for example), you must download the file of the language in question in the Downloads section of the Tesseract page, unzip the file inside the tessdata folder, and define in your code the corresponding language.
try {
String result = instance.doOCR(imageFile);
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
Finally, we read the image through the doOCR () method, passing the image as an argument, and then display the output in the Console. As we can compare, the reading is very accurate and contains very few errors.
This information is contained in the link below:
link