Problem accessing an accented file (combining character)

11

I'm trying to list a folder (files, subfolders) in Python [2.7, on Windows XP], and am having problems with accented files. I know that the method os.listdir behaves differently if the argument is a single string or a unicode string. My problem is that I have files encoded in different ways:

>>> import os
>>> os.listdir('teste')
['a\xb4rvore.jpg']
>>> os.listdir(u'teste')
[u'a\u0301rvore.jpg']
>>> os.listdir('teste2')
['\xe1rvore.txt']
>>> os.listdir(u'teste2')
[u'\xe1rvore.txt']

In Windows Explorer, both files appear normal: árvore.jpg and árvore.txt . But while the second is listed normally, the first one gives an error message no matter how I access it:

def imprimir(pasta):
    print pasta
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub
        else:
            imprimir(sub)

>>> imprimir('teste2')
teste2
teste2\ßrvore.txt
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt

>>> imprimir('teste')
teste
teste\a┤rvore.jpg
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "teste.py", line 11, in imprimir
    imprimir(sub)
  File "teste.py", line 6, in imprimir
    for x in os.listdir(pasta):
WindowsError: [Error 3] O sistema nÒo pode encontrar o caminho especificado: 'teste\a\xb4rvore.jpg/*.*'
>>> imprimir(u'teste')
teste
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "teste.py", line 9, in imprimir
    print sub
  File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0301' in position 7: character maps to <undefined>

How do I access this other file? I do not think it has a corrupted name, because a\u0301 is a valid to produce á . However, I do not know how to access it, and I have a volume with several files in that format (I can avoid producing similar files in the future, but I still need to process those already), I find it impractical to convert them by hand.     

asked by anonymous 31.12.2013 / 01:49

3 answers

4

Based on in response from @Luiz Vieira, and on this question in the English OS, I was able to find a solution. The problem was not in accessing the file itself, but only when printing its name on the screen. The code below, for example, works normally:

    if os.path.isfile(sub):
        with open(sub, 'rb') as f:
            with open(sub + u'.saida', 'wb') as s:
                s.write(f.read()) # Cria uma cópia perfeita do arquivo original
    ...
imprimir(u'teste') # Cuidado: somente a versão unicode funciona, a outra dá o mesmo erro

However, my IDLE is using the Cp850 encoding, which apparently can not print matching characters correctly. The output therefore is normalize the name of the file so that the character pair is represented by a single precompound character ( \u00e1 ):

def imprimir(pasta):
    print unicodedata.normalize('NFC', pasta)
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print unicodedata.normalize('NFC', sub)
        else:
            imprimir(sub)

>>> imprimir(u'teste')
teste
teste\árvore.jpg
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt
    
31.12.2013 / 19:13
5

For me it seems to be just a coding problem at the time of printing (calling the print method) of file / folder names.

Try using Unicode encoding by changing your printout as follows (note the addition of .encode('utf8') to the end of lines with call to print ):

def imprimir(pasta):
    print pasta.encode('utf8')
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub.encode('utf8')
        else:
            imprimir(sub)

EDIT: After reviewing your question, I think I understood another point of doubt. You are using IDLE to test interactively, but IDLE uses another encoding (in my test here I created a UTF-8 encoded .py file, so I did not have the same problem). To verify IDLE encoding, do the following:

>>> import sys
>>> sys.stdout.encoding
'cp1252'

So, to display the file names correctly, you must use this same encoding (or change the default encoding in IDLE - which I honestly do not know how to do). I did the test here, and with the cp1252 coding the names are correctly displayed:

>>> def imprimir(pasta):
    print pasta.encode('cp1252')
    for x in os.listdir(pasta):
        sub = os.path.join(pasta, x)
        if os.path.isfile(sub):
            print sub.encode('cp1252')
        else:
            imprimir(sub)

>>> imprimir(u'teste')
teste
teste\árvore.jpg
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt
>>> 
    
31.12.2013 / 15:29
0

I had a similar problem with accentuation in python. Maybe this will solve:

import sys
sys.setdefaultencoding('utf-8') # ou Latin1 ou cp1552
    
29.01.2014 / 22:59