I need to read a .csv file and rewrite to another .csv file without stopwords using Python

Question

I need to read a .csv file and rewrite to another .csv file without stopwords using Python

Navigation

#1 by (0 votes)

0

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from string import punctuation
import csv


 texto = open('arquivo_sujo.csv','r').read()

 with open('arquivo_limpo.csv', 'w') as csvfile:
    palavras = word_tokenize(texto.lower())

    stopwords = set(stopwords.words('portuguese') + list(punctuation))
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra not in stopwords]


    escrita = csv.writer(csvfile, delimiter=' ')
    escrita.writerows(palavras_sem_stopwords)

With the writerow fix for writerows solved the problem. But how do I get the new file with the same format? Each line has one word instead of the whole sentence.

python csv nltk

asked by anonymous 16.06.2018 / 00:28

1 answer

Insert into the database a value of type date [duplicate] How to use a vector in two different functions?

score 0 · Accepted Answer

I believe that the main problem is that you are reading and analyzing the entire contents of the file, and you want phrase by phrase. So to resolve, you should read each line of the input file separately:

stopwords = set(stopwords.words('portuguese') + list(punctuation))

with open('arquivo_sujo.csv') as stream_input, open('arquivo_limpo.csv', 'w') as stream_output:
    for phrase in stream_input:
        words = word_tokenize(phrase.lower())
        without_stopwords = [word for word in words if word not in stopwords]
        stream_output.write(' '.join(without_stopwords) + '\n')

In this case, each line of the input file will be processed separately and written to the output file without stopwords . As writing formatting is simple, I do not see the need to use the module csv , the join already solves the problem well.