Knowing the frequency of words

6

Hello, I would like to know if there is any function / command in R that I can find out what are the most frequent words in a text and how many times it appears.

For example, I have a very large database, but it is a database where each row is text.

And I'd like to know which words are most frequently in my database, and how often they appear.

    
asked by user20273 20.02.2015 в 19:01

3 answers

5

I do not know if there is a function for this, but I once used the suggested code in this tutorial to do this count. The final code of the tutorial is this (not exactly the tutorial code, but the one I used adapted):

texto <- scan("oslusiadas.txt", what="char", sep="\n", encoding = "UTF-8")
texto <- tolower(texto)

lista_palavras <- strsplit(texto, "\W+")
vetor_palavras <- unlist(lista_palavras)

frequencia_palavras <- table(vetor_palavras)
frequencia_ordenada_palavras <- sort(frequencia_palavras, decreasing=TRUE)

palavras <- paste(names(frequencia_ordenada_palavras), frequencia_ordenada_palavras, sep=";")

cat("Palavra;Frequencia", palavras, file="frequencias.csv", sep="\n")    

In this test I counted the words from the poem "Os Lusíadas", available at Gutenberg project page a>. In the text file used I removed the license clauses and other texts in English, leaving only the poem. The first two lines of the code read the file (in Unicode, since the text contains accented characters) and normalize the text (converting everything to lowercase). The next two lines make the word separation into a vector, the next two lines count the frequency (how much each word appears) and sort that count down (the words that appear the most are placed first). It is important not to use the "Pearl" format in the regular expression used in the strsplit , because it does not correctly treat the accented words (that is, use pearl=FALSE or do not use the parameter, since false is the value default ). And finally, the last line saves the result in a text file (I used the semicolon as a separator).

The result is something like this, and the file can be imported into Excel (for example):

Palavra;Frequencia
que;2741
e;2221
o;1953
a;1858
de;1438
se;981
os;750
;742
do;627
não;585
com;574
por;538
em;519
as;516
da;487
lhe;401
no;326
já;309
mais;283
mas;283
na;252
um;239
quem;232
ao;231
gente;230
dos;227
terra;222
tão;210
para;205
rei;204
como;195
mar;188
onde;177
the;176
é;160
seu;155
[...]
    
20.02.2015 в 19:37
3

The idea is to divide all lines of your text into words (for example, using strsplit ), concatenate all words, and count the instances of each of them (for example, using table ). The code below shows a possible implementation:

contaPalavras <- function(linhas) {
    palavras <- strsplit(linhas, "\W+")
    todas <- unlist(palavras)
    contagem <- table(todas)
    contagem[order(-contagem)]
}
linhas <- c(
    "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna.",
    "Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus.",
    "Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Proin pharetra nonummy pede. Mauris et orci.",
    "Aenean nec lorem. In porttitor. Donec laoreet nonummy augue.",
    "Suspendisse dui purus, scelerisque at, vulputate vitae, pretium mattis, nunc. Mauris eget neque at sem venenatis eleifend. Ut nonummy.",
    "Fusce aliquet pede non pede. Suspendisse dapibus lorem pellentesque magna. Integer nulla.",
    "Donec blandit feugiat ligula. Donec hendrerit, felis et imperdiet euismod, purus ipsum pretium metus, in lacinia nulla nisl eget sapien. Donec ut est in lectus consequat consequat.",
    "Etiam eget dui. Aliquam erat volutpat. Sed at lorem in nunc porta tristique.",
    "Proin nec augue. Quisque aliquam tempor magna. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.",
    "Nunc ac magna. Maecenas odio dolor, vulputate vel, auctor ac, accumsan id, felis. Pellentesque cursus sagittis felis.")
contaPalavras(linhas)

Note that you will probably want to remove words you do not want to tell, such as articles, conjunctions, prepositions, etc., but that depends on the rules of your business.

    
20.02.2015 в 19:35
0

The tokenizers package helps you do this in a very easy way!

Example:

library(tokenizers)

tokenize_words(linhas, lowercase = FALSE) %>%
  unlist() %>%
  table() %>%
  sort(decreasing = TRUE)

The cool of tokenizers is that it already does some treatments like turn everything into tiny, take punctuation, etc.

In addition, it has other functions like tokenizer_ngrams that instead of counting words, would count combinations of words.

    
19.12.2017 в 20:58