I'm trying to list a folder (files, subfolders) in Python [2.7, on Windows XP], and am having problems with accented files. I know that the method os.listdir
behaves differently if the argument is a single string or a unicode string. My problem is that I have files encoded in different ways:
>>> import os
>>> os.listdir('teste')
['a\xb4rvore.jpg']
>>> os.listdir(u'teste')
[u'a\u0301rvore.jpg']
>>> os.listdir('teste2')
['\xe1rvore.txt']
>>> os.listdir(u'teste2')
[u'\xe1rvore.txt']
In Windows Explorer, both files appear normal: árvore.jpg
and árvore.txt
. But while the second is listed normally, the first one gives an error message no matter how I access it:
def imprimir(pasta):
print pasta
for x in os.listdir(pasta):
sub = os.path.join(pasta, x)
if os.path.isfile(sub):
print sub
else:
imprimir(sub)
>>> imprimir('teste2')
teste2
teste2\ßrvore.txt
>>> imprimir(u'teste2')
teste2
teste2\árvore.txt
>>> imprimir('teste')
teste
teste\a┤rvore.jpg
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "teste.py", line 11, in imprimir
imprimir(sub)
File "teste.py", line 6, in imprimir
for x in os.listdir(pasta):
WindowsError: [Error 3] O sistema nÒo pode encontrar o caminho especificado: 'teste\a\xb4rvore.jpg/*.*'
>>> imprimir(u'teste')
teste
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "teste.py", line 9, in imprimir
print sub
File "C:\Python27\lib\encodings\cp850.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u0301' in position 7: character maps to <undefined>
How do I access this other file? I do not think it has a corrupted name, because a\u0301
is a valid to produce á
. However, I do not know how to access it, and I have a volume with several files in that format (I can avoid producing similar files in the future, but I still need to process those already), I find it impractical to convert them by hand.