How to work with several codecs in pdf?

0

I have a set of more than 1,000 PDFs that I need to extract the metadata. The problem is that PDFs have different codecs. The first example worked, I used utf8 . The second example gave an error. It's Python 3 code:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument

First example, it worked:

def decode_str(string):return string.decode("utf8")

fp = open('EMC 1-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

print (dados_recuperados)

Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'Ivanete de Araujo Costa', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114321-07'00'", 'ModDate': b"D:20170314114321-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}

Second example, gave error:

def decode_str(string):
    return string.decode("utf8")

fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
dados_recuperados = doc.info[0]
author = decode_str(dados_recuperados.get("Author"))
subject = decode_str(dados_recuperados.get("Subject"))
creation_date = decode_str(dados_recuperados.get("CreationDate"))
mod_date = decode_str(dados_recuperados.get("ModDate"))

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-21-4826dcab6968> in <module>()
      6 doc = PDFDocument(parser)
      7 dados_recuperados = doc.info[0]
----> 8 author = decode_str(dados_recuperados.get("Author"))
      9 subject = decode_str(dados_recuperados.get("Subject"))
     10 creation_date = decode_str(dados_recuperados.get("CreationDate"))

<ipython-input-21-4826dcab6968> in decode_str(string)
      1 def decode_str(string):
----> 2     return string.decode("utf8")
      3 
      4 fp = open('EMC 4-2017 PL678716 =- PL 6787-2016.pdf', 'rb')
      5 parser = PDFParser(fp)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 0: invalid start byte

print (dados_recuperados)

Resultado -> {'Title': b'\xfe\xff\x00C\x00O\x00M\x00I\x00S\x00S\x00\xc3\x00O', 'Author': b'\xfe\xff\x00M\x00O\x00D\x00.\x00C\x00O\x00N\x00L\x00E\x00.\x00S\x00T\x00 \x002\x001\x003\x000\x00/\x002\x000\x001\x007\x00 \x00-\x00 \x00P\x00_\x006\x007\x003\x006\x00 \x00-\x00 \x00D\x00a\x00v\x00i\x00 \x00R\x00i\x00b\x00e\x00i\x00r\x00o\x00 \x00d\x00e\x00 \x00O\x00l\x00i\x00v\x00e\x00i\x00r\x00a\x00 \x00J\x00\xfa\x00n\x00i\x00o\x00r', 'Subject': b'EMD ADI - Emenda Aditiva', 'Creator': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000', 'CreationDate': b"D:20170314114344-07'00'", 'ModDate': b"D:20170314114344-07'00'", 'Producer': b'\xfe\xff\x00M\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00\xae\x00 \x00W\x00o\x00r\x00d\x00 \x002\x000\x001\x000'}

Is there a way to make dados_recuperados = doc.info[0] into a standard codec? Or test before catching string to know which codec to use?

    
asked by anonymous 15.09.2017 / 20:30

1 answer

1

What you really want is to read several types of encodings and convert them all to the encoding you are using in your python script, you should probably be using something compatible with latin1 , so I recommend that before anything you set the pattern in your script, because if you run that same script on another machine maybe the default in terminal or cmd is totally different.

You can set a pattern you want, let's imagine you only want to use utf-8, so add it to your .py at the top:

# -*- coding: utf-8 -*-

If you want to use only latin1 add this:

# -*- coding: latin1 -*-

So coming back, like I said, you probably want to convert any type of encoding to the current system encoding, in case that link already helps link , the script looks like this:

Add this to the top of your script:

import sys
import cchardet
  

If you do not have the cchardet module installed just download it at link

And create this function

def str_decode(str):
    # Verifica qual o codec do sistema atual (codec "padrão")
    defaultcodec = sys.getdefaultencoding().lower()

    codec = cchardet.detect(str)['encoding']

    if (defaultcodec != codec.lower()):
        return str.decode(codec) # Se o codec for diferente do sistema atual então decodifica
    else:
        return str # Se o codec for o do sistema atual então mantêm 

It should look like this:

dados_recuperados = doc.info[0]
author = str_decode(dados_recuperados.get("Author"))
subject = str_decode(dados_recuperados.get("Subject"))
creation_date = str_decode(dados_recuperados.get("CreationDate"))
mod_date = str_decode(dados_recuperados.get("ModDate"))

Note that in% w / o% w / o% value is obtained via tr.decode(codec) , should work fine, but there is no guarantee that the PDF document is using only a codec, or that the strings are 100% correct , there may be problematic documents, but this is relative.

If you have set codec then you can set the function to:

# Verifica qual o codec do sistema atual (codec "padrão")
defaultcodec = xxxxxxx

The cchardet.detect(str)['encoding'] would be the codec you want by default.

    
15.09.2017 / 21:48