Reading of pdf via R

4

I need to convert the PDF data below into a data frame: link

Doing a search for the How to Read PDF Data in the R . I had some problems installing the package, but I managed to make it work in RStudio after all. But the result was not satisfactory, because in columns with 3 or more blank lines it jumps to another column.

    
asked by anonymous 19.03.2018 / 19:24

1 answer

4

Using the tabulizer package, I extracted information from the first page only to test:

library(tabulizer)
library(dplyr)
library(stringi)
url <- 'http://www2.alerj.rj.gov.br/leideacesso/spic/arquivo/folha-de-pagamento-2018-01.pdf'
d <- extract_tables(url, encoding = "UTF-8", pages = 1)

Then I made the list into a data frame, turned it into chr , named the variables, and removed the first line (which is actually the name of the variables)

d <- as.data.frame(d)
d <- d %>% 
  mutate_all(funs(as.character(.)))
names(d) <- d[1,]
d <- d[-1,]

Then you need to clean up the information, such as the thousand separator, the decimal separator in pdf as , , and turn that information into numeric

d <- d %>% 
  mutate_all(funs(gsub("-", NA, .)))
d <- d %>% 
  mutate_at(vars(VENCIMENTO:'TOTAL LÍQUIDO'), funs(gsub("\.", "", .))) %>% 
  mutate_at(vars(VENCIMENTO:'TOTAL LÍQUIDO'), funs(as.numeric(gsub(",", "\.", .))))

If you remove the pages option from the extract_tables function it will pull all the pages of the pdf and place it within a single list. For the join in a single table, I think do.call(rbind, d) will solve.

    
19.03.2018 / 21:49