How to read PDF

1

I'm creating a script to get a pdf and rewrite it in text.

from StringIO import StringIO
from slate import PDF
from subprocess import Popen, PIPE, call
import uuid

#pego pdf existente
url = "/tmp/arquivo.pdf"
    with open(url, "r") as arq:
        out = arq.read()

    #novo arquivo para parsear o pdf
    newfile = "/tmp/teste/" + str(uuid.uuid4()) + ".txt"

    with open(newfile, "wb") as arq:
       arq.write(out)

But the output is this:

  

'

'

'

n <

The result was not expected and a person passed me on the call (but did not explain) and on Java PDFbox, so he gave me this code:

call(["java", "-jar", "/tmp/teste/pdfbox-app-2.0.3.jar", "ExtractText", out, newfile])

I tried to use but I could not, it already starts giving error by "java". I tried calling "python" and it works but that's not what I need.

I searched but could not find a Java call as an example. Does it use?

I want a readable text and that the pdf is printed in the right order (respecting columns, lines, etc.) How do I convert a pdf into a text?

    
asked by anonymous 27.10.2016 / 22:59

1 answer

4

Reading a PDF is a much more complicated process than it sounds. If you just want to extract the text, this slate library you are importing is what does it - only in your attempt you do not even call the slate.

Another thing is that a PDF file should be opened for reading in binary mode - you put "rb" in open mode - otherwise, by default, it opens as text, and machine translation destroys the structure of the read data.

from slate import PDF
from tempfile import mktemp
...

output_name = mktemp() + ".txt"

with open(url, 'rb') as pdf_file, open(output_name, 'wt') as output:
    doc = PDF(pdf_file)
    for page in doc:
        output.write(page + '\n')

(The example of how to use slate is: link )

    
27.10.2016 / 23:55