PDF reading in R

2

I am doing a job for college and would like to get the income and the public from every Brazilian championship game of the last few years. The CBF makes available in a series of links, an example is the Borderô . For other similar problems I use the package tabulizer , as in the code below

library(tabulizer)
url <- 'https://conteudo.cbf.com.br/sumulas/2014/1421b.pdf'
d <- extract_tables(url, encoding = "UTF-8")

For tables created in PDF it works perfectly, but for this type of pdf (which was probably printed, scanned and then saved in pdf) does not work, the code returns a list with 0 elements. Any ideas or packages that I can use?

    
asked by anonymous 08.05.2018 / 02:00

1 answer

2

The table in the PDF is an image. This package of R searches for textual elements, it returns an empty list precisely because of this, because there is no text in the file. You need techniques that make recognition of text in image, I suggest you look for OCR, which is a process that extracts text from a given image.

In R there is the tesseract package, which performs this operation. Follow the link in a tutorial of the Tesseract package of R, which extracts image text.

link

In this part of the tutorial it shows how to extract from a PDF

link

    
08.05.2018 / 03:36