I need to read a .csv file and rewrite to another .csv file without stopwords using Python

0
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import csv


 texto = open('arquivo_sujo.csv','r').read()

 with open('arquivo_limpo.csv', 'w') as csvfile:
    palavras = word_tokenize(texto.lower())

    stopwords = set(stopwords.words('portuguese') + list(punctuation))
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]


    escrita = csv.writer(csvfile, delimiter=' ')
    escrita.writerows(palavras_sem_stopwords)

With the writerow fix for writerows solved the problem. But how do I get the new file with the same format? Each line has one word instead of the whole sentence.

    
asked by anonymous 16.06.2018 / 00:28

1 answer

0

I believe that the main problem is that you are reading and analyzing the entire contents of the file, and you want phrase by phrase. So to resolve, you should read each line of the input file separately:

stopwords = set(stopwords.words('portuguese') + list(punctuation))

with open('arquivo_sujo.csv') as stream_input, open('arquivo_limpo.csv', 'w') as stream_output:
    for phrase in stream_input:
        words = word_tokenize(phrase.lower())
        without_stopwords = [word for word in words if word not in stopwords]
        stream_output.write(' '.join(without_stopwords) + '\n')

In this case, each line of the input file will be processed separately and written to the output file without stopwords . As writing formatting is simple, I do not see the need to use the module csv , the join already solves the problem well.

    
16.06.2018 / 00:55