Get most used words from a string

3

I have a large string:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.

How can I capture the 4 most repeated words in this string ?

    
asked by anonymous 20.12.2014 / 22:01

4 answers

4

You have to use 3 steps:

This method counts the number of words in a string . When you pass 1 as a parameter, it returns an array with all the words.

This method returns a new array where the values of the initial are keys and the values of those keys are the frequency of this value.

This method arranges the array to have the highest values at startup.

Example:

$string = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.';
$palavras = array_count_values(str_word_count($string, 1));
arsort($palavras);
var_dump($palavras);

Will give:

array(64) {
  ["quis"]=>
  int(3)
  ["tristique"]=>
  int(2)
  ["varius"]=>
  int(2)
  ["a"]=>
  int(2)
  ["eleifend"]=>
  int(2)
  ["et"]=>
  int(2)
  ["libero"]=>
  int(2)
  ["felis"]=>
  int(2)
  ["eget"]=>
  int(2)
  etc...
    
20.12.2014 / 22:10
3

Essentially, it will be necessary to break the text into words into an array. Then we need to count the repeats, sort the result from the highest number of repeats for the least number of repeats, and finally get only the first X.

For this purpose we will use the PHP function array_count_values() to count the values in the array, the PHP function str_word_count() to count the number of times the word exists in the given text, PHP function < to order the array in descending order without losing the relation to the key and finally the function of PHP arsort() to stay in the array just the right amount of words:

/**
 * Palavras Mais Repetidas
 * Com base no texto recebido, devolver as primeiras X
 * palavras mais repetidas
 *
 * @param string $texto O texto a avaliar
 * @param integer $quantidade A quantidade de palavras a devolver
 *
 * @return array Matriz com as palavras mais repetidas
 */
function palavrasMaisRepetidas($texto="", $quantidade=4) {

  $palavras = array_count_values(str_word_count($texto, 1));

  arsort($palavras);

  return array_slice($palavras, 0, $quantidade);
}

Example:

$texto = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.";

var_dump(palavrasMaisRepetidas($texto, 5));

Result:

array(4) {
  ["quis"]=>
  int(3)
  ["tristique"]=>
  int(2)
  ["varius"]=>
  int(2)
  ["a"]=>
  int(2)
}

See example on Ideone .

    
20.12.2014 / 22:30
2

Sergio and Zuul's answers probably perform better, but there's a didactic solution that uses strtok to break the text into words, and do the counting manually. This solution is case-insensitive .

<?php
$texto = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas porttitor non felis quis dignissim. Morbi varius arcu lorem, eget efficitur nibh interdum vitae. Aenean tristique hendrerit diam a consequat. Nunc eleifend dolor ut rhoncus sollicitudin. Suspendisse tincidunt sodales turpis et egestas. Sed maximus libero malesuada lacus tempor, quis placerat nunc varius. Nam eget lectus imperdiet, lobortis mi sit amet, tristique justo. Fusce in felis et erat auctor vehicula quis dapibus libero. In commodo a leo eu eleifend.";
$frequencias = array();
$separadores = " .,;:!?/\"'()[]{}\n\r\t";
$palavra = strtok($texto, $separadores);
while($palavra !== false) {
    if(array_key_exists(strtoupper($palavra), $frequencias)) {
        $frequencias[strtoupper($palavra)]++;
    } else {
        $frequencias[strtoupper($palavra)] = 1;
    }
    $palavra = strtok($separadores);
}
arsort($frequencias);
print_r($frequencias);

link

Result:

Array
(
    [QUIS] => 3
    [ET] => 2
    [VARIUS] => 2
    [IN] => 2
    [TRISTIQUE] => 2
    [A] => 2
    [LIBERO] => 2
    [LOREM] => 2
    [ELEIFEND] => 2
    [NUNC] => 2
    [FELIS] => 2
    [EGET] => 2
    [DOLOR] => 2
    [SIT] => 2
    [AMET] => 2
    [LEO] => 1
    [TEMPOR] => 1
    ...
)
    
21.12.2014 / 20:54
1

Dude, try something like this:

Explode this string through space using explode:
link

Then use array_count_value:
link

Just grab the first 4 spaces of the array!

20.12.2014 / 22:13