How to read tables from a pdf?

Question

How to read tables from a pdf?

Navigation

#1 by (-1 votes)

0

I am trying to read tables from a certain pdf file, using iTextSharp , I found many answers that indicate using LocationTextExtractionStrategy , but my table can vary in position along the PDF pages.

Does anyone have any idea how I can solve this problem?

c# itextsharp

asked by anonymous 28.08.2016 / 16:11

1 answer

Internal Server Error with Htaccess file has how to execute a query within a user-defined function in php?

score -1 · Answer 1

One solution to reading a table from a PDF is to use a two-step approach:

Extract the PDF in plain text

Interpret plain text with some tool;

Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.

With the pdfbox.jar library you can extract the text by passing the basic parameters "ExtractText". Ex.:

java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt

Each line in the PDF will be a line n text file. Each row of the table will be too.

With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way you can then read the PDF files.

Sample PDF file:

Textoutput:
Supervisor:
LanofuSilva
Date  06/09/2018
ProductIDDateQuantity
3AllenScrew6"04/09/2018 300
9 Axle 127 Revest. 9/4/2018 500
15 Profile 3 12x15 09/05/2018 400
72 Metal case 15x15x5 09/02/2018 100
70 Helical Gear 1 "Nylon 09/01/2048 100
45 Heline Drone 5H-12BR 02/09/2018 130

An example Patter to recognize each production entry in the table:

Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
int countEntradaProducao = 0;
For(String linha : Files.readAllBytes(arquivoTexto.toPath())
{   
    if(padraoLinha.matcher(linha).matchs())
    {
        countEntradaProducao++;
        //Faço alguma coisa
    }
}