Using the tabulizer
package, I extracted information from the first page only to test:
library(tabulizer)
library(dplyr)
library(stringi)
url <- 'http://www2.alerj.rj.gov.br/leideacesso/spic/arquivo/folha-de-pagamento-2018-01.pdf'
d <- extract_tables(url, encoding = "UTF-8", pages = 1)
Then I made the list into a data frame, turned it into chr
, named the variables, and removed the first line (which is actually the name of the variables)
d <- as.data.frame(d)
d <- d %>%
mutate_all(funs(as.character(.)))
names(d) <- d[1,]
d <- d[-1,]
Then you need to clean up the information, such as the thousand separator, the decimal separator in pdf as ,
, and turn that information into numeric
d <- d %>%
mutate_all(funs(gsub("-", NA, .)))
d <- d %>%
mutate_at(vars(VENCIMENTO:'TOTAL LÍQUIDO'), funs(gsub("\.", "", .))) %>%
mutate_at(vars(VENCIMENTO:'TOTAL LÍQUIDO'), funs(as.numeric(gsub(",", "\.", .))))
If you remove the pages
option from the extract_tables
function it will pull all the pages of the pdf and place it within a single list. For the join in a single table, I think do.call(rbind, d)
will solve.