Problems reading an image PDF with TESS4J

0

Recently I started to develop a small executable jar that converts PDF to text files and it will work in a Windows environment.

Using TESS4J 3.3.1, I developed the following process:

A) The user can choose to insert a PDF or an image;

B) If it is a PDF, the system will convert to image using GHOST4J;

C) The image will be converted to text using TESS4J.

For most of the files tested the program worked correctly, but when I inserted an invoice file (in PDF) with a logo, the program (at point C) can not convert 10% / p>

import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import javax.imageio.ImageIO;

import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.SimpleRenderer;

public class PDFToImage {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    public static List<File> convert(File filePDF) throws Exception{
        PDFDocument document = new PDFDocument();
        try {
            document.load( new FileInputStream( filePDF ) );
        } catch (IOException e) {
            throw e;
        }

        SimpleRenderer renderer = new SimpleRenderer();
        renderer.setResolution( 300 );

        List<Image> renderedImageList = null;
        try {
            renderedImageList = renderer.render(document);
        } catch (Exception e) {
            throw e;
        }

        List<File> fileImageList = new ArrayList<File>();
        try {
            for( Image i : renderedImageList ){
                File f = new File( "C:\Users\story\Desktop\ocr_test" + File.separator + filePDF.getName() + "_" + renderedImageList.indexOf( i ) + sdf.format( new Date() ) + ".png" ); 
                ImageIO.write((RenderedImage) i, "png", f);
                fileImageList.add( f );
            }
        } catch (Exception e) {
            throw e;
        }

        return fileImageList;
    }

}

Test file:

import java.io.File;
import java.util.List;

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class Basic  {

    // Teste: A, B e C
//  public static void main(String[] args) throws Exception {
//      File pdfFile = new File("C:\Users\story\Desktop\ocr_test\source_pdf.pdf");
//
//      List<File> imageList = PDFToImage.convert(pdfFile);
//
//      ITesseract instance = new Tesseract();
//      instance.setLanguage("eng");
//      instance.setDatapath("C:\Users\story\Desktop\ocr_test\tessdata");
//
//      for( File i : imageList ){
//          try {
//              String result = instance.doOCR( i );
//              System.out.println(result);
//          } catch (TesseractException e) {
//              System.err.println(e.getMessage());
//          }
//      }
//  }

    // Teste: B e C
    public static void main(String[] args) throws Exception {
        ITesseract instance = new Tesseract();
        instance.setLanguage("eng");
        instance.setDatapath("C:\Users\story\Desktop\ocr_test\tessdata");
        try {
            String result = instance.doOCR( new File("C:\Users\story\Desktop\ocr_test\source_png_split.png") );
            System.out.println(result);
        } catch (TesseractException e) {
            System.err.println(e.getMessage());
        }
    }

}

Image of problem PDF:

If I remove (in the same paint) this logo, the image is perfectly converted! In this case I have the doubts:

1) In TESS4J: is there a way to prevent this error?

2) In GHOST4J: is there any way I can not convert this image in the PDF to the final image?

    
asked by anonymous 11.04.2017 / 14:45

1 answer

0

I managed to solve the problem! After searching a bit more in Google, I modified the class PDFToImage.java I was able to solve the problem in two different ways:

package core;

import java.awt.Image;
import java.awt.image.RenderedImage;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.ResourceBundle;

import javax.imageio.ImageIO;

import org.ghost4j.document.PDFDocument;
import org.ghost4j.renderer.SimpleRenderer;

import util.Utils;

public class PDFToImage {

    private static final ResourceBundle properties = ResourceBundle.getBundle( "properties/configuration" );
    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    @SuppressWarnings("rawtypes")
    public static List<File> convert(File preFilePDF, Class clazz) throws Exception {
        // Inicio trecho adicionado
        File filePDF = preFilePDF;

        if( Boolean.parseBoolean( properties.getString("PDF_STAMP_IMAGE") ) ){
            filePDF = PDFStamper.convert( preFilePDF, clazz);
        }

        if( Boolean.parseBoolean( properties.getString("PDF_REMOVE_IMAGE") ) ){
            filePDF = PDFRemoveImage.convert( preFilePDF );
        }
        // Fim trecho adicionado

        PDFDocument document = new PDFDocument();
        try {
            document.load( new FileInputStream( filePDF ) );
        } catch (IOException e) {
            throw e;
        }

        SimpleRenderer renderer = new SimpleRenderer();
        renderer.setResolution( 300 );

        List<Image> renderedImageList = null;
        try {
            renderedImageList = renderer.render(document);
        } catch (Exception e) {
            throw e;
        }

        if( !filePDF.canExecute() 
                && !filePDF.canExecute()
                && !filePDF.canRead() ){
            throw new Exception("Sem permissão na pasta "+filePDF.getAbsolutePath());
        }

        List<File> fileImageList = new ArrayList<File>();
        try {
            for( Image i : renderedImageList ){
                File f = new File( "C:\Users\story\Desktop\ocr_test" + File.separator + filePDF.getName() + "_" + renderedImageList.indexOf( i ) + sdf.format( new Date() ) + ".png" ); 
                ImageIO.write((RenderedImage) i, "png", f);
                fileImageList.add( f );
            }
        } catch (Exception e) {
            throw e;
        }

        return fileImageList;
    }

}

1) Removing all PDF images using PDFBox

At first this method seemed to be the final solution, since the output (in PDF) was perfect, but when converting the resulting PDF to image through GHOST4J, the file lost its settings and formatting, losing some important characters like CPF / CNPJ and also losing all special characters.

package core;

import java.io.File;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;

import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSDictionary;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFOperator;

public class PDFRemoveImage {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");

    @SuppressWarnings("rawtypes")
    public static File convert(File in) throws Exception {
        String out = "C:\Users\story\Desktop\ocr_test" + File.separator + in.getName() + "_" + sdf.format( new Date() ) + ".pdf";

        PDDocument doc = PDDocument.load(in);

        List pages = doc.getDocumentCatalog().getAllPages();
        for( int i=0; i<pages.size(); i++ ) {
            PDPage page = (PDPage)pages.get( i );

            COSDictionary newDictionary = new COSDictionary(page.getCOSDictionary());

            PDFStreamParser parser = new PDFStreamParser(page.getContents());
            parser.parse();
            List tokens = parser.getTokens();
            List newTokens = new ArrayList();
            for(int j=0; j<tokens.size(); j++) {
                Object token = tokens.get( j );

                if( token instanceof PDFOperator ) {
                    PDFOperator op = (PDFOperator)token;
                    if( op.getOperation().equals( "Do") ) {
                        COSName name = (COSName)newTokens.remove( newTokens.size() -1 );
                        deleteObject(newDictionary, name);
                        System.out.println( name.getName() );
                        continue;
                    }
                }
                newTokens.add( token );
            }
            PDStream newContents = new PDStream( doc );
            ContentStreamWriter writer = new ContentStreamWriter( newContents.createOutputStream() );
            writer.writeTokens( newTokens );
            newContents.addCompression();

            page.setContents( newContents );

            PDResources newResources = new PDResources(newDictionary);
            page.setResources(newResources);
        }

        doc.save(out);
        doc.close();

        return new File( out );
    }

    private static boolean deleteObject(COSDictionary d, COSName name) {
        for(COSName key : d.keySet()) {
            if( name.equals(key) ) {
                d.removeItem(key);
                return true;
            }
            COSBase object = d.getDictionaryObject(key); 
            if(object instanceof COSDictionary) {
                if( deleteObject((COSDictionary)object, name) ) {
                    return true;
                }
            }
        }
        return false;
    }
}

2) Placing an image on top of the PDF using iText

After some time, I came up with this solution that put an image on the problem image, I opted for a black square and the rest of the program worked perfectly!

package core;

import java.io.File;
import java.io.FileOutputStream;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.ResourceBundle;

import com.itextpdf.text.Image;
import com.itextpdf.text.pdf.PdfContentByte;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.PdfStamper;

public class PDFStamper {

    private static final SimpleDateFormat sdf = new SimpleDateFormat("ddMMyyyy_HHmmss");
    private static final ResourceBundle properties = ResourceBundle.getBundle("properties.configuration");

    @SuppressWarnings("rawtypes")
    public static File convert(File in, Class clazz) throws Exception {
        File out = new File( "C:\Users\story\Desktop\ocr_test" + File.separator + in.getName() + "_" + sdf.format( new Date() ) + ".pdf" );
        try {
            PdfReader pdfReader = new PdfReader( in.getAbsolutePath() );

            PdfStamper pdfStamper = new PdfStamper(pdfReader, new FileOutputStream(out));

            Image image = Image.getInstance( "C:\Users\story\Desktop\ocr_test" + File.separator + "replacer.png" );
            for(int i=1; i<= pdfReader.getNumberOfPages(); i++){
                PdfContentByte content = pdfStamper.getOverContent(i);
                if( properties.getString("PDF_STAMP_METHOD").equals("SIMPLE") ){
                    image.setAbsolutePosition(40f, 725f);
                } else if( properties.getString("PDF_STAMP_METHOD").equals("TEMPLATE") ){
                    image.setAbsolutePosition(0f, 0f);
                }
                content.addImage(image);
            }

            pdfStamper.close();

            return out;
        } catch (Exception e) {
            e.printStackTrace();
            throw e;
        }
    }
}

It's worth noting that I adopted the second option as a definitive one but with a setting that can be evaluated by those who used it: in my case I only had problems with only one image at a fixed point but if you go to read several files with different layouts , you can use templates, creating a replacer.png of the size of your PDF.

Comments:

  • For improvements I would like to implement a method to separate the PDF files in some way for the use of the template;
  • Or try to improve the use of PDFBox by removing only the images without side effects to the PDF or process.
  • I want to make the complete code available in GITHUB soon, when I do, I'll leave the link here.
12.04.2017 / 14:40