I'm creating a script to get a pdf and rewrite it in text.
from StringIO import StringIO
from slate import PDF
from subprocess import Popen, PIPE, call
import uuid
#pego pdf existente
url = "/tmp/arquivo.pdf"
with open(url, "r") as arq:
out = arq.read()
#novo arquivo para parsear o pdf
newfile = "/tmp/teste/" + str(uuid.uuid4()) + ".txt"
with open(newfile, "wb") as arq:
arq.write(out)
But the output is this:
'
'
'n <
The result was not expected and a person passed me on the call (but did not explain) and on Java PDFbox, so he gave me this code:
call(["java", "-jar", "/tmp/teste/pdfbox-app-2.0.3.jar", "ExtractText", out, newfile])
I tried to use but I could not, it already starts giving error by "java". I tried calling "python" and it works but that's not what I need.
I searched but could not find a Java call as an example. Does it use?
I want a readable text and that the pdf is printed in the right order (respecting columns, lines, etc.) How do I convert a pdf into a text?