I have a set of more than 1,000 PDFs that I need to extract the metadata. The problem is that PDFs have different codecs.
The first example worked, I used utf8
. The second example gave an error. It's Python 3 code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
First example, it worked:
def decode_str(string):return string.decode("utf8")
fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))
print (dados_recuperados)
Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}
Second example, gave error:
def decode_str(string):
return string.decode("utf8")
fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-21-4826dcab6968> in <module>()
6 doc = PDFDocument(parser)
7 dados_recuperados = doc.info[0]
----> 8 author = decode_str(dados_recuperados.get("Author"))
9 subject = decode_str(dados_recuperados.get("Subject"))
10 creation_date = decode_str(dados_recuperados.get("CreationDate"))
<ipython-input-21-4826dcab6968> in decode_str(string)
1 def decode_str(string):
----> 2 return string.decode("utf8")
3
4 fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
5 parser = PDFParser(fp)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte
print (dados_recuperados)
Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'\xfe\xff\x00M\x00O\x00D\x00.\x00C\x00O\x00N\x00L\x00E\x00.\x00S\x00T\x00 \x002\x001\x003\x000\x00/\x002\x000\x001\x007\x00 \x00-\x00 \x00P\x00_\x006\x007\x003\x006\x00 \x00-\x00 \x00D\x00a\x00v\x00i\x00 \x00R\x00i\x00b\x00e\x00i\x00r\x00o\x00 \x00d\x00e\x00 \x00O\x00l\x00i\x00v\x00e\x00i\x00r\x00a\x00 \x00J\x00\xfa\x00n\x00i\x00o\x00r', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114344-07'00'", 'ModDate': b"D:20170314114344-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}
Is there a way to make dados_recuperados = doc.info[0]
into a standard codec? Or test before catching string
to know which codec to use?