Recently I started to develop a small executable jar that converts PDF to text files and it will work in a Windows environment.
Using TESS4J 3.3.1, I developed the following process:
A) The user can choose to insert a PDF or an image;
B) If it is a PDF, the system will convert to image using GHOST4J;
C) The image will be converted to text using TESS4J.
For most of the files tested the program worked correctly, but when I inserted an invoice file (in PDF) with a logo, the program (at point C) can not convert 10% / p>
import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import javax.imageio.ImageIO;
import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.SimpleRenderer;
public class PDFToImage {
private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");
public static List<File> convert(File filePDF) throws Exception{
PDFDocument document = new PDFDocument();
try {
document.load( new FileInputStream( filePDF ) );
} catch (IOException e) {
throw e;
}
SimpleRenderer renderer = new SimpleRenderer();
renderer.setResolution( 300 );
List<Image> renderedImageList = null;
try {
renderedImageList = renderer.render(document);
} catch (Exception e) {
throw e;
}
List<File> fileImageList = new ArrayList<File>();
try {
for( Image i : renderedImageList ){
File f = new File( "C:\Users\story\Desktop\ocr_test" + File.separator + filePDF.getName() + "_" + renderedImageList.indexOf( i ) + sdf.format( new Date() ) + ".png" );
ImageIO.write((RenderedImage) i, "png", f);
fileImageList.add( f );
}
} catch (Exception e) {
throw e;
}
return fileImageList;
}
}
Test file:
import java.io.File;
import java.util.List;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class Basic {
// Teste: A, B e C
// public static void main(String[] args) throws Exception {
// File pdfFile = new File("C:\Users\story\Desktop\ocr_test\source_pdf.pdf");
//
// List<File> imageList = PDFToImage.convert(pdfFile);
//
// ITesseract instance = new Tesseract();
// instance.setLanguage("eng");
// instance.setDatapath("C:\Users\story\Desktop\ocr_test\tessdata");
//
// for( File i : imageList ){
// try {
// String result = instance.doOCR( i );
// System.out.println(result);
// } catch (TesseractException e) {
// System.err.println(e.getMessage());
// }
// }
// }
// Teste: B e C
public static void main(String[] args) throws Exception {
ITesseract instance = new Tesseract();
instance.setLanguage("eng");
instance.setDatapath("C:\Users\story\Desktop\ocr_test\tessdata");
try {
String result = instance.doOCR( new File("C:\Users\story\Desktop\ocr_test\source_png_split.png") );
System.out.println(result);
} catch (TesseractException e) {
System.err.println(e.getMessage());
}
}
}
Image of problem PDF:
If I remove (in the same paint) this logo, the image is perfectly converted! In this case I have the doubts:
1) In TESS4J: is there a way to prevent this error?
2) In GHOST4J: is there any way I can not convert this image in the PDF to the final image?