Convert PDF to Text with Python

4

Well, I have a PDF file that is in a website, I would like to know how to get the text from this PDF and put it in a variable.

Access the site with the PDF I know, my difficulty is in converting this PDF into text, or simply copying the text.

I'm using Python 3.x

    
asked by anonymous 10.09.2017 / 07:45

1 answer

2

As a PDF is an image, to extract the texts you need an OCR package (you need to keep in mind that these packages may not have 100% accuracy), there are several in python, for which you want a good interesting that works in python 2.7 and 3.4, textract.

See an example:

import textract
text = textract.process("orcamento.pdf")
print (text)  

Clicar para incluir o cabeçalho

EXEMPLO DE ORÇAMENTO: Exemplos de Itens Detalhados
OBSERVAÇÃO : Este é somente um exemplo. Nem todos os orçamentos terão todos os exemplos listados abaixo. Favor usar somente os itens que dizem
respeito ao seu projeto proposto.

I. SALÁRIOS
Diretor Executivo
Diretor de Projeto
Contador
Editor Sênior
Editor

Salário Anual
5000
4000
2000
750
500

Porcentagem
50%
100%
50%
20%
45%

I used this pdf for example, of course I copied only part of the result, just for demonstration.

Note:

  • In your case, you would have to download the pdf to a local directory and carry out the example process.
  • To install in python 3, see this link .
10.09.2017 / 17:30