Higher frequency of a string

2

I have a text where I am handling various things. Now I need to get the 3 words that repeat the most in the entire text. How can I do this? What is the best solution?

I thought about storing in a list but I do not know how to put the word and an accountant. I also do not know if it would be a "smart" way to do it?

    
asked by anonymous 09.09.2014 / 03:11

2 answers

4

Assuming you've already done the work of separating the text into words, the data structure you're looking for is java.util.Map (also known as" dictionary "). Its function is just to map an object (the "key") to another object (the "value") - or in its case, the word to the number of times it appears.

Map<String,Integer> quantas = new HashMap<String,Integer>();

quantas.put("foo", 1); // Diz que a palavra "foo" apareceu 1 vez

int x = quantas.get("foo"); // Pega o número de vezes que ela apareceu
quantas.put("foo", x+1);    // Atualiza o valor (sobrescreve)

// Enumera todas as palavras do dicionário
for ( Map.Entry<String,Integer> par : quantas.entrySet() )
    System.out.println(par.getKey() + " apareceu " + par.getValue() + " vezes.");

If you're using Java 8, you can do this even more simply by using lambdas :

for ( String palavra : listaPalavras )
    quantas.compute(palavra, (k, v) -> v == null ? 1 : v+1);

(i.e. if the word is not yet in the dictionary - v == null - check that it appeared 1 time, otherwise increment 1 in the number of times it appeared)

As for a "smart" way to find the N words that popped up the most, I suggest a priority queue ( PriorityQueue ) where:

  • The comparator used in the creation order pairs (palavra, nº ocorrências) in ascending order of number of occurrences;
  • Once you've created and populated Map quantas , you'll go through adding elements to that priority queue, keeping your size limited to the number you want (ie remove the smaller ones when the queue grows beyond the point you want - so the performance of the algorithm will be good, because it will avoid comparing elements that do not matter).

I'm not going to post a full example because even the simplest Java tasks take 10x more code than a decent language . But if you find it difficult to use the above method you can ask what I explain better. Or, if you're not concerned about performance and just want a simple, straightforward method, put those pairs in a list and order them.

Update: until it was not so bad , but damn, it lacks an inference of types do not ...

int max = 3;

PriorityQueue<Map.Entry<String,Integer>> fila = 
    new PriorityQueue<Map.Entry<String,Integer>>(max+1, new Comparator<Map.Entry<String,Integer>>() {
        public int compare(Map.Entry<String,Integer> a, Map.Entry<String,Integer> b) {
            return a.getValue() < b.getValue() ? -1 :
                   a.getValue() > b.getValue() ? 1 :
                   a.getKey().compareTo(b.getKey()); // Desempate
        }
    });

for ( Map.Entry<String,Integer> entry : quantas.entrySet() ) {
    fila.add(entry);
    if ( fila.size() > max )
        fila.poll(); // Remove o menor
}
    
09.09.2014 / 05:07
2
  • Create a HashMap of String and integers to sweep the text by adding words to the map if they are not in it.

    The advantage of the map is precisely the fact that it is a structure that operates with keys (in this case words) and with values (integers).
    This way you only have to increase the value, if it is already on the map.

  • Check the map entries for the most frequent entries.

Example code:

   
import java.util.HashMap;
import java.util.Map;

public class FreqPalavra {
    private static final String LOREM_IPSUM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec a diam lectus. Sed sit amet ipsum mauris. Maecenas congue ligula ac quam viverra nec consectetur ante hendrerit. Donec et mollis dolor. Praesent et diam eget libero egestas mattis sit amet vitae augue. Nam tincidunt congue enim, ut porta lorem lacinia consectetur. Donec ut libero sed arcu vehicula ultricies a non tortor. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean ut gravida lorem. Ut turpis felis, pulvinar a semper sed, adipiscing id dolor. Pellentesque auctor nisi id magna consequat sagittis. Curabitur dapibus enim sit amet elit pharetra tincidunt feugiat nisl imperdiet. Ut convallis libero in urna ultrices accumsan. Donec sed odio eros. Donec viverra mi quis quam pulvinar at malesuada arcu rhoncus. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. In rutrum accumsan ultricies. Mauris vitae nisi at sem facilisis semper ac in est."
            + "Vivamus fermentum semper porta. Nunc diam velit, adipiscing ut tristique vitae, sagittis vel odio. Maecenas convallis ullamcorper ultricies. Curabitur ornare, ligula semper consectetur sagittis, nisi diam iaculis velit, id fringilla sem nunc vel mi. Nam dictum, odio nec pretium volutpat, arcu ante placerat erat, non tristique elit urna et turpis. Quisque mi metus, ornare sit amet fermentum et, tincidunt et orci. Fusce eget orci a orci congue vestibulum. Ut dolor diam, elementum et vestibulum eu, porttitor vel elit. Curabitur venenatis pulvinar tellus gravida ornare. Sed et erat faucibus nunc euismod ultricies ut id justo. Nullam cursus suscipit nisi, et ultrices justo sodales nec. Fusce venenatis facilisis lectus ac semper. Aliquam at massa ipsum. Quisque bibendum purus convallis nulla ultrices ultricies. Nullam aliquam, mi eu aliquam tincidunt, purus velit laoreet tortor, viverra pretium nisi quam vitae mi. Fusce vel volutpat elit. Nam sagittis nisi dui."
            + "Suspendisse lectus leo, consectetur in tempor sit amet, placerat quis neque. Etiam luctus porttitor lorem, sed suscipit est rutrum non. Curabitur lobortis nisl a enim congue semper. Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt. Phasellus eget dolor sit amet ipsum dapibus condimentum vitae quis lectus. Aliquam ut massa in turpis dapibus convallis. Praesent elit lacus, vestibulum at malesuada et, ornare et est. Ut augue nunc, sodales ut euismod non, adipiscing vitae orci. Mauris ut placerat justo. Mauris in ultricies enim. Quisque nec est eleifend nulla ultrices egestas quis ut quam. Donec sollicitudin lectus a mauris pulvinar id aliquam urna cursus. Cras quis ligula sem, vel elementum mi. Phasellus non ullamcorper urna."
                    .replaceAll("[.,]", "");

    public static void main(String[] args) {
        Map<String, Integer> mapaFreq = new HashMap<>();
        // Cria o mapa de Frequências
        for (String palavra : LOREM_IPSUM.split("\s+")) {
            if (!mapaFreq.containsKey(palavra)) {
                mapaFreq.put(palavra, 1);
            } else {
                mapaFreq.put(palavra, 1 + mapaFreq.get(palavra));
            }
        }
        // Arrays para armazenar os 3 valores mais frequentes.
        String[] palavrasMaisFrequentes = new String[3];
        int[] freqPalavras = new int[3];
        //Percorre todos os valores do mapa
        for (Map.Entry<String, Integer> entrada : mapaFreq.entrySet()) {
            //Se achar algo mais frequente que a primeira posição
            if (entrada.getValue() >= freqPalavras[0]) {
                freqPalavras[0] = entrada.getValue();
                palavrasMaisFrequentes[0] = entrada.getKey();

            } else {
                if (entrada.getValue() >= freqPalavras[1]) {
                    freqPalavras[1] = entrada.getValue();
                    palavrasMaisFrequentes[1] = entrada.getKey();
                } else if (entrada.getValue() >= freqPalavras[2]) {
                    freqPalavras[2] = entrada.getValue();
                    palavrasMaisFrequentes[2] = entrada.getKey();
                }
            }
//          System.out.println(entrada.getKey() + "/" + entrada.getValue()); imprime todo o mapa
        }
        for (int i = 0; i < freqPalavras.length; i++) {
            System.out.println(i + 1 + " palavra: " + palavrasMaisFrequentes[i]
                    + " \nFrequência: " + freqPalavras[i]
                    + "\n------------------------\n");
        }

    }
}

You can see this example by running in ideone .

Note

For simplification purposes the above code does not pass the value to the other positions of the array (which is a logic error, because if a term is more frequent it should update the array cascade)

  • Example:
freqPalavras[]     |   palavrasMaisFrequentes[]
      13           |          "estouro"
      11           |          "da"
      09           |          "pilha"

If a word "batman" with frequency 14 appears the new order should be:

freqPalavras[]     |   palavrasMaisFrequentes[]
      14           |          "batman"
      13           |          "estouro"
      11           |          "da"

But in the program above would be:

freqPalavras[]     |   palavrasMaisFrequentes[]
      14           |          "batman"
      11           |          "da"
      09           |          "pilha"
    
09.09.2014 / 05:09