One solution to reading a table from a PDF is to use a two-step approach:
Extract the PDF in plain text
Interpret plain text with some tool;
Of course this solution is only valid if the PDF file is textual, ie it is not contained by a main image.
With the pdfbox.jar
library you can extract the text by passing the basic parameters "ExtractText". Ex.:
java -jar pdfbox.jar ExtractText C:\CAMINHO\ARQUIVO\PDF\relatorio.pdf c:\Caminho\Arquivo\Saida\Saida.txt
Each line in the PDF will be a line n text file. Each row of the table will be too.
With the output file, you can interpret using Pattern which is a very useful tool for this purpose. This way you can then read the PDF files.
Sample PDF file:
Textoutput:
Supervisor:
LanofuSilva
Date 06/09/2018
ProductIDDateQuantity
3AllenScrew6"04/09/2018 300
9 Axle 127 Revest. 9/4/2018 500
15 Profile 3 12x15 09/05/2018 400
72 Metal case 15x15x5 09/02/2018 100
70 Helical Gear 1 "Nylon 09/01/2048 100
45 Heline Drone 5H-12BR 02/09/2018 130
An example Patter to recognize each production entry in the table:
Pattern padraoLinha = Patter.compile(“\d+\s.+\s\d{2}/\d{2}/\d{4}\s\d+”);
int countEntradaProducao = 0;
For(String linha : Files.readAllBytes(arquivoTexto.toPath())
{
if(padraoLinha.matcher(linha).matchs())
{
countEntradaProducao++;
//Faço alguma coisa
}
}