Create a frequency table based on another column in Python

1

Good afternoon, guys.

I have a data set in a .csv file, containing two columns: tweets and sorting, where 'tweets' matches any tweet searched on twitter and 'sorting' matches 'positive' or 'negative'.

I want to make a frequency table, word for word, where each line contains a non-repeated word and the classification of this word in the sentence.

Well, does numpy or nltk have any function that does this?

I'm trying to make two loops, one to go through the rows and another to scroll word by word, but I'm not sure what data structure to use to make this frequency table or how the algorithm would look.

So far I have this:

    import nltk
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation
from nltk.probability import FreqDist
import numpy as np

#lê o arquivo
dataset = pd.read_csv('tweets.csv')

#busca as stop_words em português e remove a palavra 'não' para não causar contradições
stopwords = set(stopwords.words('portuguese') + list(punctuation))
stopwords = {x for i,x in enumerate(stopwords) if x != 'não'}

#busca o que são tweets e o que são classes
tweets = dataset['Text'].values
classes = dataset['Classificacao'].values

for tweet in tweets:
    for palavra in tweet:
        print(palavra)

As it is, it was for the algorithm to begin word by word, but it is printing letter by letter, and I do not understand why.

I know it's not what I want, but it's the beginning.

Any help would be welcome, thanks.

    
asked by anonymous 09.06.2018 / 18:59

2 answers

1

This table you want to calculate is called Histogram .

Here is a code capable of calculating a Histogram from a .CSV file:

import csv
import string
from collections import Counter

palavras = []

with open('tweets.csv' ) as arqcsv:
    leitor = csv.reader( arqcsv, delimiter=';')
    for linha in leitor:
        palavras += [ palavra.strip( string.punctuation ) for palavra in linha[0].lower().split() ]

cnt = Counter( palavras )

for palavra, frequencia in sorted(cnt.items(), key=lambda i: i[1], reverse=True):
    print( '{} : {}'.format(palavra,frequencia) )

Test file ( tweets.csv ):

Lorem ipsum dolor sit amet, consectetur adipiscing elit.;Positivo
Pellentesque scelerisque odio rutrum nunc facilisis convallis.;Positivo
Maecenas luctus luctus purus interdum venenatis.;Positivo
Nulla elementum id purus nec interdum.;Positivo
Sed malesuada nec est id convallis.;Positivo
Vivamus non facilisis mauris.;Negativo
Nullam lacinia massa libero, in vulputate nisi faucibus et.;Negativo
Mauris maximus justo vel suscipit consequat.;Negativo
Morbi sit amet neque rutrum, semper ante aliquam, egestas enim.;Positivo
Integer eget mauris faucibus, efficitur odio nec, accumsan justo.;Positivo
Sed tristique felis risus, quis tristique dolor tempor ut.;Positivo
Etiam vel magna augue.;Negativo
Quisque blandit, elit nec sollicitudin rhoncus, lectus congue lacus.;Positivo
Donec sit amet enim vel leo gravida malesuada vitae sed tortor.;Positivo
Morbi in maximus ex, vitae pharetra tellus.;Negativo
Orci varius natoque penatibus et magnis dis parturient montes.;Positivo
Nascetur ridiculus mus. Etiam at felis pharetra, porta risus sed.;Negativo

Output:

nec : 4
sed : 4
mauris : 3
vel : 3
sit : 3
amet : 3
risus : 2
interdum : 2
justo : 2
purus : 2
in : 2
dolor : 2
et : 2
etiam : 2
id : 2
felis : 2
facilisis : 2
pharetra : 2
rutrum : 2
elit : 2
tristique : 2
vitae : 2
malesuada : 2
maximus : 2
faucibus : 2
morbi : 2
enim : 2
odio : 2
convallis : 2
luctus : 2
ipsum : 1
leo : 1
efficitur : 1
augue : 1
vivamus : 1
orci : 1
maecenas : 1
ut : 1
donec : 1
semper : 1
nunc : 1
ante : 1
ex : 1
tellus : 1
egestas : 1
massa : 1
aliquam : 1
gravida : 1
porta : 1
magna : 1
pellentesque : 1
nulla : 1
quisque : 1
parturient : 1
mus : 1
rhoncus : 1
scelerisque : 1
consectetur : 1
sollicitudin : 1
at : 1
suscipit : 1
non : 1
blandit : 1
est : 1
accumsan : 1
nisi : 1
adipiscing : 1
magnis : 1
varius : 1
natoque : 1
consequat : 1
ridiculus : 1
eget : 1
elementum : 1
montes : 1
integer : 1
libero : 1
lacinia : 1
neque : 1
tempor : 1
nullam : 1
dis : 1
vulputate : 1
lectus : 1
nascetur : 1
venenatis : 1
tortor : 1
quis : 1
penatibus : 1
lorem : 1
lacus : 1
congue : 1
    
10.06.2018 / 00:10
1

Complementing the @Lacobus response, to know the classification of each word, you can separate the positives and negatives as follows:

import csv
import string
from collections import Counter

palavras = []
positivo = []
negativo = []

with open('tweets.csv' ) as arqcsv:
    leitor = csv.reader( arqcsv, delimiter=';')
    for linha in leitor:
        plinha = [palavra.strip( string.punctuation ) for palavra in linha[0].lower().split()]
        palavras += plinha
        if(linha[1].lower() == 'positivo'):
            positivo += plinha
        else:
            negativo += plinha

cntPalavras = Counter(palavras)
cntPositivo = Counter(positivo)
cntNegativo = Counter(negativo)


for palavra, frequencia in sorted(cntPalavras.items(), key=lambda i: i[1], reverse=True):
    pos = cntPositivo[palavra]
    neg = cntNegativo[palavra]
    print( '{} : [ f: {}, p: {}, n: {} ]'.format(palavra,frequencia, pos, neg) )

Using the same test csv file will result in the following output:

nec: [f: 4, p: 4, n: 0]
sed: [f: 4, p: 3, n: 1]
sit: [f: 3, p: 3, n: 0]
amet: [f: 3, p: 3, n: 0]
mauris: [f: 3, p: 1, n: 2]
vel: [f: 3, p: 1, n: 2]
dolor: [f: 2, p: 2, n: 0]
elit: [f: 2, p: 2, n: 0]
odio: [f: 2, p: 2, n: 0]
rutrum: [f: 2, p: 2, n: 0]
facilisis: [f: 2, p: 1, n: 1]
convallis: [f: 2, p: 2, n: 0]
luctus: [f: 2, p: 2, n: 0]
purus: [f: 2, p: 2, n: 0]
interdum: [f: 2, p: 2, n: 0]
id: [f: 2, p: 2, n: 0]
malesuada: [f: 2, p: 2, n: 0]
in: [f: 2, p: 0, n: 2]
faucibus: [f: 2, p: 1, n: 1]
et: [f: 2, p: 1, n: 1]
maximus: [f: 2, p: 0, n: 2]
justo: [f: 2, p: 1, n: 1]
morbi: [f: 2, p: 1, n: 1]
enim: [f: 2, p: 2, n: 0]
tristique: [f: 2, p: 2, n: 0]
felis: [f: 2, p: 1, n: 1]
risus: [f: 2, p: 1, n: 1]
etiam: [f: 2, p: 0, n: 2]
vitae: [f: 2, p: 1, n: 1]
pharetra: [f: 2, p: 0, n: 2]
lorem: [f: 1, p: 1, n: 0]
ipsum: [f: 1, p: 1, n: 0]
consectetur: [f: 1, p: 1, n: 0]
adipiscing: [f: 1, p: 1, n: 0]
pellentesque: [f: 1, p: 1, n: 0]
scelerisque: [f: 1, p: 1, n: 0]
nunc: [f: 1, p: 1, n: 0]
maecenas: [f: 1, p: 1, n: 0]
venenatis: [f: 1, p: 1, n: 0]
nulla: [f: 1, p: 1, n: 0]
elementum: [f: 1, p: 1, n: 0]
est: [f: 1, p: 1, n: 0]
vivamus: [f: 1, p: 0, n: 1]
non: [f: 1, p: 0, n: 1]
nullam: [f: 1, p: 0, n: 1]
lacinia: [f: 1, p: 0, n: 1]
massa: [f: 1, p: 0, n: 1]
libero: [f: 1, p: 0, n: 1]
vulputate: [f: 1, p: 0, n: 1]
nisi: [f: 1, p: 0, n: 1]
suscipit: [f: 1, p: 0, n: 1]
consequat: [f: 1, p: 0, n: 1]
neque: [f: 1, p: 1, n: 0]
semper: [f: 1, p: 1, n: 0]
ante: [f: 1, p: 1, n: 0]
aliquam: [f: 1, p: 1, n: 0]
egestas: [f: 1, p: 1, n: 0]
integer: [f: 1, p: 1, n: 0]
eget: [f: 1, p: 1, n: 0]
efficitur: [f: 1, p: 1, n: 0]
accumsan: [f: 1, p: 1, n: 0]
quis: [f: 1, p: 1, n: 0]
tempor: [f: 1, p: 1, n: 0]
ut: [f: 1, p: 1, n: 0]
magna: [f: 1, p: 0, n: 1]
augue: [f: 1, p: 0, n: 1]
quisque: [f: 1, p: 1, n: 0]
blandit: [f: 1, p: 1, n: 0]
sollicitudin: [f: 1, p: 1, n: 0]
rhoncus: [f: 1, p: 1, n: 0]
lectus: [f: 1, p: 1, n: 0]
congue: [f: 1, p: 1, n: 0]
lacus: [f: 1, p: 1, n: 0]
donec: [f: 1, p: 1, n: 0]
leo: [f: 1, p: 1, n: 0]
gravida: [f: 1, p: 1, n: 0]
tortor: [f: 1, p: 1, n: 0]
ex: [f: 1, p: 0, n: 1]
tellus: [f: 1, p: 0, n: 1]
orci: [f: 1, p: 1, n: 0]
varius: [f: 1, p: 1, n: 0]
natoque: [f: 1, p: 1, n: 0]
penatibus: [f: 1, p: 1, n: 0]
magnis: [f: 1, p: 1, n: 0]
dis: [f: 1, p: 1, n: 0]
parturient: [f: 1, p: 1, n: 0]
montes: [f: 1, p: 1, n: 0]
nascetur: [f: 1, p: 0, n: 1]
ridiculus: [f: 1, p: 0, n: 1]
mus: [f: 1, p: 0, n: 1]
at: [f: 1, p: 0, n: 1]
porta: [f: 1, p: 0, n: 1]
    
27.07.2018 / 18:27