Good morning, I'm trying to develop a simple algorithm in Python for removing stop words from text, however I'm having problems with words that have accents.
The Code is as follows:
import io
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from unicodedata import normalize
import sys
reload(sys)
sys.setdefaultencoding('utf8')
stop_words = set(stopwords.words('portuguese'))
file1 = open("C:\Users\Desktop\Teste.txt")
print("Arquivo lido!")
line = file1.read()
palavras = line.split()
#Converte as palavra para letra minúscula
palavras = [palavra.lower() for palavra in palavras]
print("Rodando!")
for r in palavras:
if r not in stop_words:
appendFile = open('textofiltrado.txt','a')
appendFile.writelines(" "+r)
appendFile.close()
print("Concluido!")
When running the code with the following test file:
E É Á A O Ó U Ú
I have this output:
É Á Ó Ú
That is, it does not recognize words that have an accent, using setdefaultencoding for utf-8 did not work, does anyone know of a solution I can use to solve this problem?