How to return most common words in a text with PHP?

Question

How to return most common words in a text with PHP?

Navigation

#1 by (5 votes)
#2 by (4 votes)
#3 by (2 votes)

5

I'd like to know how best to return the most frequent occurrences of substrings in a string containing text. Example:

$texto = "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...";

And the output:

array(
    "PHP" => 2
    "de" => 2
    //...
);

The idea is to return a array with the most used words in a certain string .

I'm currently using the substr_count() function, but the problem is that it only works if you already pass a word to be checked, that is, I would need to know the words in the text to check one by one.

Is there any other way to do this?

php

asked by anonymous 29.06.2014 / 20:55

3 answers

4

My "handmade" way would be:

$texto = "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.";

$palavras = explode(' ', $texto);
echo count($palavras); // 91
$ocorrencias = array();

for($i = 0; $i<count($palavras); $i++){
    $palavra = $palavras[$i];
    $ocorrencias[$palavra]++;
}

arsort($ocorrencias);
var_dump($ocorrencias);

Result:

array(69) { 
    ["the"]=> int(6) 
    ["Lorem"]=> int(4) 
    ["of"]=> int(4) 
    ["Ipsum"]=> int(3) 
    ["and"]=> int(3) 
    ["a"]=> int(2) 
    // etc

The advantage of this alternative is that I only need to separate by blanks.

You can also add a line like this, before explode() :

$texto = preg_replace('/[,\.?!;]*/', '', $texto);

to clear commas and periods, etc. Depending on what you are looking for.

29.06.2014 / 21:31

2

My solution

This solution is a little more robust, it separates each word and treats it "meticulously", after it has been treated and approved, it changes to a new array that is then organized by the number of occurrences.

<?php
$texto = "Hoje nós vamos falar de PHP! mas o que é PHP?? 
PHP é uma linguagem criada no ano de ...";

/* Separar cada palavra por espaços (raw, sem filtro) */
$palavras_raw = explode(" ", $texto);

// Array de caracteres para serem removidos
$ignorar = 
[".", ",", "!", ";", ":", "(", ")", "{", "}", "[", "]", "<", ">",
"?", "|", "\", "/"];

// Array para as palavras tratadas.
$palavrasTratadas = array();

/* Criar uma nova array de palavras, agora tratadas */
$palavras_raw_count = count($palavras_raw);
for ($i=0;$i<$palavras_raw_count;++$i) {
    $palavraAtual = $palavras_raw[$i];
    $palavraAtual = trim($palavraAtual);
    if (!empty($palavraAtual)) {
        $palavraTratada = str_replace($ignorar, "", $palavraAtual);
        $palavraTratada = strtolower($palavraTratada);
        if (!empty($palavraTratada)) {
            $palavrasTratadas[$palavraTratada]++;
        }
    }
}

// Organizar pela ordem de mais ocorrências.
arsort($palavrasTratadas);

// DEBUG
print_r($palavrasTratadas);

It separates each word by the spaces criteria and removes the special characters from the $ignorar array after it treats all words to prevent unexpected errors / results and passes to the $palavrasTratadas array, NOT DIFFERENCE lowercase uppercase, because someone can start the sentence with the capital letter Today and then use today in the rest of the text, however the function of passing the words to lowercase of PHP is done for English, so it does not convert For instance, for example.

30.06.2014 / 04:06

Methods without parameters and with parameters Remove an item from a List

score 5 · Accepted Answer

Try this:

print_r(array_count_values(str_word_count($texto, 1, "óé")));

Result:

Array ( 
   [Hoje] => 1 
   [nós] => 1 
   [vamos] => 1 
   [falar] => 1 
   [de] => 2 
   [PHP] => 2 
   [uma] => 1 
   [linguagem] => 1 
   [criada] => 1 
   [no] => 1 
   [é] => 1
   [ano] => 1 
)

To understand how array_count_values works see the php manual .

Editing

A smarter solution (language independent)

With the above solution, you need to specify the entire set of utf-8 special characters (as was done with ó and é ).

Following a tricky solution, however, eliminates the problem of the special character set.

$text = str_replace(".","", "Hoje nós vamos falar de PHP. PHP é uma linguagem criada no ano de ...");
$namePattern = '/[\s,:?!]+/u';
$wordsArray = preg_split($namePattern, $text, -1, PREG_SPLIT_NO_EMPTY);
$wordsArray2 = array_count_values($wordsArray);
print_r($wordsArray2);

In this solution I use regular expressions to break words and then I use array_count_values to count words. The result is:

Array 
( 
  [Hoje] => 1 
  [nós] => 1 
  [vamos] => 1 
  [falar] => 1 
  [de] => 2 
  [PHP] => 2 
  [é] => 1 
  [uma] => 1 
  [linguagem] => 1 
  [criada] => 1 
  [no] => 1 
  [ano] => 1 
)

This solution also meets the need, however, the points must be eliminated before splitting the words, otherwise words with . and words without . will appear in the result. For example:

  ...
  [PHP.] => 1 
  [PHP] => 1 
  ...

Word counting is never such a simple task. It is necessary to know well the string who wants to count the words before applying a definitive solution.