PDF for text arranging the columns

7

I am doing scraping to extract .pdf files, and I need these files as an organized text, since for each line of text of the file there are 3 different columns.

For example in this file, you can see the 3 columns in question. I can read the file as .txt with the following code:

library("rvest")
library("pdftools")

pdf_link <- "http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=3&data=03/04/2017&captchafield=firistAccess"

# Inicia seção e acessa o .pdf
s <- html_session(pdf_link) %>%
  jump_to(pdf_link)

# Salva o arquivo como pdf e depois le
tmp <- tempfile(fileext = '.pdf')
writeBin(s$response$content,tmp)
doc <- pdf_text(tmp)

The problem is that each line of the text file, the 3 columns are separated by spaces, and each line (with the 3 columns) is separated by a \r\n .

What I wanted was to separate the columns for the text to make sense.

The idea I had is:

  • Separate the rows to the \r\n
  • Separate the columns based on the number of spaces (for example: if there is a sequence of 5 consecutive spaces, consider a column).

I've never messed with strings and regex, so I'm having difficulties. And I'm going to need to automate this for multiple files, which can cause a lot of errors because of the number of spaces or the layout of the columns.

If there is any other solution based on the specificities of .pdf it would also be very interesting.

    
asked by anonymous 24.04.2017 / 21:30

2 answers

5

See if this helps:

doc1<-unlist(stringr::str_split(doc,"\s{5,}|\n"))
c1<-paste0(doc1[seq(5,length(doc1),3)],collapse = " ")
c2<-paste0(doc1[seq(6,length(doc1),3)],collapse = " ")
c3<-paste0(doc1[seq(7,length(doc1),3)],collapse = " ")

You can try using the tabulizer package as well. It apparently overcomes the limitations of columns of different sizes:

library(tabulizer)
tmp<-tempfile()

url<-"http://pesquisa.in.gov.br/imprensa/servlet/INPDFViewer?jornal=1&pagina=2&data=03/04/2017&captchafield=firistAccess"

httr::GET(url,write_disk(tmp))

doc<-extract_text(tmp)
    
24.04.2017 / 23:32
7

José's answer is great for the page in question. But try using this algorithm on page 2 or 10 and you will see that things go a little out of control.

This is because not all columns are the same size in the DOU (an assumption in Joseph's response). In the case of page 2 the first column has less than 40 rows and the remaining text is equally divided between the two remaining columns or even because the number of elements of doc1 that must be "skipped" - doc1[1:4] - vary. / p>

My approach to this problem so far has been:

  • Open *.pdf of DOU in Word and save as *.txt (this can be automated in many ways, but I do not know any by R ).

  • Read *.txt with readLines() . In%% of Word created by the columns (being one, two or three) are "stacked" so that you can work more easily with the text.

  • The advantage / disadvantage in this way is that you rely on Microsoft's algorithm to deal with *.txt , which is much better than one that can be created quickly but is beyond our control.     

    25.04.2017 / 15:33