Sort dictionary and add python values

2

I have this text file that is processed to capitalize and this part does it correctly.

olá meu nome é meu nome pois eu olá
é meu nome walt não disney
olá

Then I have this function that should be able to calculate the frequency of each word (and do it as it should). And then you must sort the dataFreq list and calculate the probability of a particular word appearing in the text. That is, this way: frequenciaPalavra/totalPalavras

def countWordExact(dataClean):

    count = {}
    dataFreq = []
    global total

    for word in dataClean.splitlines():
        for word in word.split(" "):
            if word in count:
                count[word] += 1
            else:
                count[word] = 1
            total += 1

    dataFreq.append(count)

    freq = []

    for indice in sorted(count, key=count.get):
        #print(count[indice])
        freq.append((count[indice])/total)
    #print(freq)

    return dataFreq

My question is: how to order the dictionary (consecutively the list) and add to it the values resulting from the calculation of the frequency indicated above? I give the example:

[{'olá': 0.12, 'meu': 0.12, 'nome': 0.132, 'é': 0.12321, 'pois': 0.56, 'eu': 0.65, 'walt': 0.7, 'não': 0.7, 'disney': 0.5}]

(the above frequency values are wrong)

    
asked by anonymous 15.12.2018 / 17:35

1 answer

2

All the logic of calculating the frequency is already implemented natively in Python at collections.Counter , the only thing you need to do is divide the frequency that the word appears in the text by the total number of words:

from collections import Counter

texto = """
olá meu nome é meu nome pois eu olá
é meu nome walt não disney
olá
"""

palavras = texto.split()
frequencias = Counter(palavras)
# Counter({'olá': 3, 'meu': 3, 'nome': 3, 'é': 2, 'pois': 1, 'eu': 1, 'walt': 1, 'não': 1, 'disney': 1})

To calculate the percentage:

total = len(palavras)
probabilidades = {}

for palavra, frequencia in frequencias.items():
    probabilidades[palavra] = frequencia/total

print(probabilidades)

Resulting in:

{'olá': 0.1875, 'meu': 0.1875, 'nome': 0.1875, 'é': 0.125, 'pois': 0.0625, 'eu': 0.0625, 'walt': 0.0625, 'não': 0.0625, 'disney': 0.0625

Or in summary form:

probabilidades = {palavra: frequencia/total for palavra, frequencia in frequencias.items()}
    
15.12.2018 / 17:47