Scraping in Python - read pdf

6

I've done a scrapping in Python which takes a URL from any PDF, reads and returns, however in some PDFs I'm having the problem come with some characters like this:

  

". \ nO \ xc3 \ xb3rg \ xc3 \ xa3o also \ xc3 \ xa9m discloses   result \ n \ nGH \ xc3 \ x80QLWLYR \ x03GRV \ x03FDQGLGDWRV \ x03TXH \ x03VH \ x03GHFODUDUDP \ x03FRP \ x03GH \ xc3 \ x80FLrQFLD \ x03H \ x03GRV \ x03SHGLGRV \ x03 \ n   special assistance granted. \ nThe contest aims to provide \ nefetivo   of 150 places for the \ ninitial class (Class A) of the position of delegate of   Civil Poll, whose vacancies will be \ xc3 \ xa3o \ n \ nproved according to   order of   clasVL \ xc3 \ x80FDomR \ x03H \ x03D \ x03QHFHVVLGDGH \ x03GR \ x03VHUYLoR \ x11 \ nA "

From what I could see, this happens when you have some accent, column or even trace in the document.

I also noticed that if you have an image, it returns strange characters! Does anyone have a solution or idea that can help me?

    
asked by anonymous 19.09.2016 / 21:42

2 answers

3

Another alternative is to use str.encode with encoding Latin 1 and str.decode to decode to UTF-8. Here's an example:

print ("\xc3\xb3".encode('latin1').decode('utf-8')) # ó

In your case, do so:

print (texto.encode('latin1').decode('utf-8'))

Where texto is the variable you want to apply encode / decode .

Result:

O órgão também divulga resultado

GHÀQLWLYRGRVFDQGLGDWRVTXHVHGHFODUDUDPFRPGHÀFLrQFLDHGRVSHGLGRV
de atendimento especial deferidos.
O concurso visa o provimento
efetivo de 150 vagas para a classe
inicial (Classe A) do cargo de delegado de Polícia Civil, cujas vagas serão

providas conforme a ordem de clasVLÀFDomRHDQHFHVVLGDGHGRVHUYLoR
A
    
19.09.2016 / 22:52
2

Using Python and pdfminer (#

import pdfminer
from pdfminer.pdfinterp import PDFResourceManager, process_pdf
from pdfminer.pdfdevice import TagExtractor
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.utils import set_debug_logging
import io

class LeitorPdf():
    def __init__(self, **kwargs):
        self.resource_manager = PDFResourceManager(caching=False)
        self.output_stream = io.StringIO()
        self.device = TextConverter(self.resource_manager, self.output_stream, laparams=None)

    def extrair_texto(self, file_name):
        fp = io.open(file_name, 'rb')
        process_pdf(self.resource_manager, self.device, fp, set(), maxpages=0, password='', caching=False, check_extractable=True)
        return self.output_stream.getvalue()

PDF needs to be saved somewhere before.

Usage:

texto = LeitorPdf().extrair_texto(nome_arquivo)
    
19.09.2016 / 22:25