Problem with application for stopwords and accents filter

2

Good morning, I'm trying to develop a simple algorithm in Python for removing stop words from text, however I'm having problems with words that have accents.

The Code is as follows:

import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys

reload(sys)
sys.setdefaultencoding('utf8')

stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Teste.txt")
print("Arquivo lido!")
line = file1.read()
palavras = line.split()
#Converte as palavra para letra minúscula
palavras = [palavra.lower() for palavra in palavras]
print("Rodando!")
for r in palavras:
    if r not in stop_words:
            appendFile = open('textofiltrado.txt','a')
            appendFile.writelines(" "+r)
            appendFile.close()

print("Concluido!")

When running the code with the following test file:

E É Á A O Ó U Ú

I have this output:

 É Á Ó Ú

That is, it does not recognize words that have an accent, using setdefaultencoding for utf-8 did not work, does anyone know of a solution I can use to solve this problem?

    
asked by anonymous 27.06.2018 / 16:21

1 answer

0

Use

palavra.decode('utf-8').lower()

Source: here

    
27.06.2018 / 17:31