How to read tables from a pdf?

0

I am trying to read tables from a certain pdf file, using iTextSharp , I found many answers that indicate using LocationTextExtractionStrategy , but my table can vary in position along the PDF pages.

Does anyone have any idea how I can solve this problem?

    
asked by anonymous 28.08.2016 / 16:11

1 answer

-1

One solution to reading a table from a PDF is to use a two-step approach:

  • Extract the PDF in plain text
  • Interpret plain text with some tool;
  • Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.

    With the pdfbox.jar library you can extract the text by passing the basic parameters "ExtractText". Ex.:

    java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt
    

    Each line in the PDF will be a line n text file. Each row of the table will be too.

    With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way you can then read the PDF files.

    Sample PDF file:

      

    Textoutput:  
    Supervisor:  
    LanofuSilva  
    Date  06/09/2018  
    ProductIDDateQuantity  
    3AllenScrew6"04/09/2018 300   
    9 Axle 127 Revest. 9/4/2018 500   
    15 Profile 3 12x15 09/05/2018 400   
    72 Metal case 15x15x5 09/02/2018 100   
    70 Helical Gear 1 "Nylon 09/01/2048 100   
    45 Heline Drone 5H-12BR 02/09/2018 130

    An example Patter to recognize each production entry in the table:

    Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
    int countEntradaProducao = 0;
    For(String linha : Files.readAllBytes(arquivoTexto.toPath())
    {   
        if(padraoLinha.matcher(linha).matchs())
        {
            countEntradaProducao++;
            //Faço alguma coisa
        }
    }
    
        
    06.09.2018 / 17:10