Open set of texts in python to apply functions (len, set, colocations, etc.). UnicodeDecodeError

-1
>>> import nltk    
>>> from nltk.corpus import PlaintextCorpusReader  
>>> meucorpus='C:\Users\dudu\Desktop\Artigos sem acentos'   
>>> meustextos=PlaintextCorpusReader(meucorpus,'.*')  
>>> meustextos.words()  

Traceback (most recent call last):  
  File "<pyshell#4>", line 1, in <module>  
    meustextos.words()  
  File "C:\Python27\lib\site-packages\nltk\compat.py", line 498, in wrapper  
    return method(self).encode('ascii', 'backslashreplace')  
  File "C:\Python27\lib\site-packages\nltk\util.py", line 664, in __repr__  
    for elt in self:  
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\util.py", line 394,   in iterate_from  
    for tok in piece.iterate_from(max(0, start_tok-offset)):  
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\util.py", line 291,   in iterate_from  
    tokens = self.read_block(self._stream)  
  File "C:\Python27\lib\site-packages\nltk\corpus\reader\plaintext.py", line   117, in _read_word_block  
    words.extend(self._word_tokenizer.tokenize(stream.readline()))  
  File "C:\Python27\lib\site-packages\nltk\data.py", line 1102, in readline  
    new_chars = self._read(readsize)  
  File "C:\Python27\lib\site-packages\nltk\data.py", line 1329, in _read  
    chars, bytes_decoded = self._incr_decode(bytes)  
  File "C:\Python27\lib\site-packages\nltk\data.py", line 1359, in   _incr_decode  
    return self.decode(bytes, 'strict')  
  File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode  
    return codecs.utf_8_decode(input, errors, True)  
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0:   invalid start byte  
    
asked by anonymous 09.08.2015 / 04:40

1 answer

1

As your error message says, you have a file that is invalid text in UTF-8: that is, even though your directory name is "no accents", you have text there yes. And the accent coding is not the universal utf-8, so it's a good chance to be latin1 - which is the one used by Windows in Brazil, in the GUI.

The call to PlaintextCorpusReader supports an optional argument with the encoding of the text files - change it to: > > >

meustextos=PlaintextCorpusReader(meucorpus,'.*', encoding='latin1')

and the error of UnicodeDecode error will disappear. However, although you do not have any visible errors, unless all of your text files are in Latin1, they can be read incorrectly - if you have files in utf-8 mixed in the directory, for example, your accented characters will be read as junk - and you will have problems with your data.

If this occurs, you will have to standardize the articles so that they all fit into a single text encoding.

    
13.08.2015 / 05:08