I am doing scraping to extract .pdf
files, and I need these files as an organized text, since for each line of text of the file there are 3 different columns.
For example in this file, you can see the 3 columns in question.
I can read the file as .txt
with the following code:
library("rvest")
library("pdftools")
pdf_link <- "http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=3&data=03/04/2017&captchafield=firistAccess"
# Inicia seção e acessa o .pdf
s <- html_session(pdf_link) %>%
jump_to(pdf_link)
# Salva o arquivo como pdf e depois le
tmp <- tempfile(fileext = '.pdf')
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)
The problem is that each line of the text file, the 3 columns are separated by spaces, and each line (with the 3 columns) is separated by a \r\n
.
What I wanted was to separate the columns for the text to make sense.
The idea I had is:
- Separate the rows to the
\r\n
- Separate the columns based on the number of spaces (for example: if there is a sequence of 5 consecutive spaces, consider a column).
I've never messed with strings and regex, so I'm having difficulties. And I'm going to need to automate this for multiple files, which can cause a lot of errors because of the number of spaces or the layout of the columns.
If there is any other solution based on the specificities of .pdf
it would also be very interesting.