How to isolate metadata with pdfminer?

0

I wrote this code in Python 3 to read the metadata of a PDF:

>>> from pdfminer.pdfparser import PDFParser
>>> from pdfminer.pdfdocument import PDFDocument
>>> fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
>>> parser = PDFParser(fp)
>>> doc = PDFDocument(parser)
>>> print(doc.info)

And as a result it generates:

[{'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}]

Please, does anyone know how to isolate the results in variables? For example, in the above case get the results:

a = "Ivanete de Araujo Costa" (campo Author)
b = "EMD ADI - Emenda Aditiva" (campo Subject)
c = "D:20170314114321-07'00" (campo CreationDate)
d = "D:20170314114321-07'00" (campo ModDate)
    
asked by anonymous 15.09.2017 / 13:24

1 answer

0

As you already can recover the content of the pdf becomes easier .. this structure returned is a dictionary, where each key points to a value, and its syntax is as follows:

dicionario.get("chave","valor_default")

Based on this example, let's go to yours.

Since you already have the dictionary, let's separate it into variables:

# Código anterior

#Aqui recupero o dicionário, pois ele está dentro de uma lista
dados_recuperados = doc.info[0]

#Aqui crio uma função para retornar a string literal, visto que é retornado a forma em bytes do conteúdo.
def decode_str(string):
    return string.decode("utf8")

# E por fim recupero cada chave passando ela para a função de conversão de conteúdo.
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

# Fazer qualquer coisa com as variáveis.

See here for the script running on Ideone.

    
15.09.2017 / 13:43