How to check for similarity between strings?

20

It is very common to compare strings , which is usually done by comparing equality. However, today I have come up with the need to compare the similarity between two strings , so that I can compare how similar they are.

For example:

"City" is different from "city", but is similar (consider c ! = C ).

"Cdad" is different from "city", but it is similar (assuming the user entered wrong).

"city" is different from "city", but it is similar (the same thing).

My question is: how do I check the similarity between strings ? Is there any algorithm ready that does this?

I'm looking for a way to do this:

if(checkSimilarity("cidade","cdade") > 0.8)
{
    // as duas strings são muito parecidas
    // sendo que checkSimilarity() retornando 1 para strings iguais
}

It does not matter the programming language used in the response. My question is more related to the algorithm.

    
asked by anonymous 25.03.2014 / 01:28

5 answers

10

I think the Levenshtein algorithm is a solution that suits well what you intend to do, since with it you can know the number of modifications by which one word must pass to match the other.

A possible implementation of it (in C ++) would be:

#include <iostream.h>
#include <string.h>
#include <math.h>
#include <algorithm>

using namespace std;

int VerificarSimilaridade(string palavra1, string palavra2)
{
    int tam1 = palavra1.size();
    int tam2 = palavra2.size();
    int verif[tam1 + 1][tam2 + 1];

    // Se uma das palavras tiver coprimento igual a zero, é necessária uma modificação do tamanho da outra:
    if (tam1 == 0)
        return tam2;

    if (tam2 == 0)
        return tam1;

    // Atribuir ordem numérica resultante das palavras ao vetor de modo, respectivamente, "vertical" e "horizontal":
    int i = 0;
    while(i <= tam1)
        verif[i][0] = i++;

    int j = 0; 
    while(j <= tam2)
        verif[0][j] = j++;

    // Verificação:
    for (int i = 1; i <= tam1; i++)
    {
        for (int j = 1; j <= tam2; j++)
        {
            // Definindo custos de modificação (deleção, inserção e substituição):
            int custo = (palavra2[j - 1] == palavra1[i - 1]) ? 0 : 1;

            verif[i][j] = min(
                min(verif[i - 1][j] + 1, verif[i][j - 1] + 1), 
                verif[i - 1][j - 1] + custo);
        }
    }

    return verif[tam1][tam2];
 }

 int main()
 {
     string pala1, pala2;

     cout << "Informe a primeira palavra: " << endl;
     cin >> pala1;
     cout << "Informe a segunda palavra: " << endl;
     cin >> pala2;

     cout << "O numero de modificacoes necessarias para que as duas palavras se igualem e: " 
          << VerificarSimilaridade(pala1, pala2) << endl;

     system("pause");
     return 0;   
 }

In addition, from the number of modifications you can create your own criteria to quantify the level of similarity between two strings; something interesting would be to calculate the percentage of change according to the length of the words, so through a reference degree it would be possible to verify that 5 modifications can have high value in a string with 7 letters and, at the same time, for example.

    
25.03.2014 / 04:22
7

Here is a fairly simple solution proposal (I did in Java):

public class Teste {

    protected static float checkSimilarity(String sString1, String sString2) throws Exception {

        // Se as strings têm tamanho distinto, obtêm a similaridade de todas as
        // combinações em que tantos caracteres quanto a diferença entre elas são
        // inseridos na string de menor tamanho. Retorna a similaridade máxima
        // entre todas as combinações, descontando um percentual que representa
        // a diferença em número de caracteres.
        if(sString1.length() != sString2.length()) {
            int iDiff = Math.abs(sString1.length() - sString2.length());
            int iLen = Math.max(sString1.length(), sString2.length());
            String sBigger, sSmaller, sAux;

            if(iLen == sString1.length()) {
                sBigger = sString1;
                sSmaller = sString2;
            }
            else {
                sBigger = sString2;
                sSmaller = sString1;
            }

            float fSim, fMaxSimilarity = Float.MIN_VALUE;
            for(int i = 0; i <= sSmaller.length(); i++) {
                sAux = sSmaller.substring(0, i) + sBigger.substring(i, i+iDiff) + sSmaller.substring(i);
                fSim = checkSimilaritySameSize(sBigger,  sAux);
                if(fSim > fMaxSimilarity)
                    fMaxSimilarity = fSim;
            }
            return fMaxSimilarity - (1f * iDiff) / iLen;

        // Se as strings têm o mesmo tamanho, simplesmente compara-as caractere
        // a caractere. A similaridade advém das diferenças em cada posição.
        } else
            return checkSimilaritySameSize(sString1, sString2);
    }

    protected static float checkSimilaritySameSize(String sString1, String sString2) throws Exception {

        if(sString1.length() != sString2.length())
            throw new Exception("Strings devem ter o mesmo tamanho!");

        int iLen = sString1.length();
        int iDiffs = 0;

        // Conta as diferenças entre as strings
        for(int i = 0; i < iLen; i++)
            if(sString1.charAt(i) != sString2.charAt(i))
                iDiffs++;

        // Calcula um percentual entre 0 e 1, sendo 0 completamente diferente e
        // 1 completamente igual
        return 1f - (float) iDiffs / iLen;
    }

    public static void main(String[] args) {
        try {
            System.out.println("'ABCD' vs 'ab' = " + checkSimilarity("ABCD", "ab"));
            System.out.println("'cidade' vs 'cdade' = " + checkSimilarity("cidade", "cdade"));
            System.out.println("'cidade' vs 'ciDade' = " + checkSimilarity("cidade", "ciDade"));
            System.out.println("'cidade' vs 'cdiade' = " + checkSimilarity("cidade", "cdiade"));
            System.out.println("'cidade' vs 'edadic' = " + checkSimilarity("cidade", "edadic"));
            System.out.println("'cidade' vs 'CIDADE' = " + checkSimilarity("cidade", "CIDADE"));
            System.out.println("'cidade' vs 'CIdADE' = " + checkSimilarity("cidade", "CIdADE"));
            System.out.println("'cidade' vs 'CdADE' = " + checkSimilarity("cidade", "CdADE"));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

The principle of the algorithm is quite simple:

  • If the strings are exactly the same size, the checkSimilarity method will simply invoke the checkSimilaritySameSize base method, which compares character strings from left to right, counts the number of errors ) and returns a percentage of errors in relation to the size of the strings.
  • If, on the other hand, strings have different sizes, the checkSimilarity method does the same test for every possible combination in which as many characters as the difference are included in the largest string in all positions of the smaller string. For example, supposing string 1 as "ABCD" (size = 4) and string 2 as "ab" (size = 2), the combinations will be "ABab", "aBCb" and "abCD".
  • Among the similarities calculated between the combinations and the largest string (since both are now the same size), the method chooses the maximum value, but deducts an "error rate" proportional to the number of characters in the difference .

So, the result of running the given code is as follows for these examples:

'ABCD' vs 'ab' = 0.0
'cidade' vs 'cdade' = 0.8333333
'cidade' vs 'ciDade' = 0.8333333
'cidade' vs 'cdiade' = 0.6666666
'cidade' vs 'edadic' = 0.0
'cidade' vs 'CIDADE' = 0.0
'cidade' vs 'CIdADE' = 0.16666669
'cidade' vs 'CdADE' = 0.16666664

The closer to 1, the more similar the strings are, and the closer to 0 the more distinct they are. If you do not want to differentiate uppercase (note the example on the third line), just convert both strings to greater (with String::toUpperCase() ) before comparing them.

Note that this solution is very simple because it admits that errors (if any) are due to the lack of one or more characters in sequence. You can improve the algorithm so that it considers all possible real character combinations, but it will probably be like using a bazooka to kill an ant.

    
25.03.2014 / 03:18
7

There is an article ( How to Strike a Match ) created by Simon White related to what you want, he wrote an article about an algorithm that compares adjacent pairs of characters that should be useful to you.

Some advantages over other algorithms (such as Soundex , Levenshtein , among others, olhe aqui ) are:

  • A true reflection of lexical similarity - strings with small differences should be recognized as being similar.
  • Robustness to word order changes - two strings that contain the same words, but in a different order, should be recognized as similar. On the other hand, if a string is just a random anagram of the characters contained in the other, then it should ( usually ) be recognized as different.
  • Language Independence - The algorithm should work not only in English, but in many different languages.
  • For example, France should be similar to Français and República da França , and República da França should resemble both República Francesa and Republique Francaise . ( Free Translation )

    Code below developed in C #.

    /// <summary>
    /// This class implements string comparison algorithm
    /// based on character pair similarity
    /// Source: http://www.catalysoft.com/articles/StrikeAMatch.html
    /// </summary>
    public class SimilarityTool
    {
        /// <summary>
        /// Compares the two strings based on letter pair matches
        /// </summary>
        /// <param name="str1"></param>
        /// <param name="str2"></param>
        /// <returns>The percentage match from 0.0 to 1.0 where 1.0 is 100%</returns>
        public double CompareStrings(string str1, string str2)
        {
            List<string> pairs1 = WordLetterPairs(str1.ToUpper());
            List<string> pairs2 = WordLetterPairs(str2.ToUpper());
    
            int intersection = 0;
            int union = pairs1.Count + pairs2.Count;
    
            for (int i = 0; i < pairs1.Count; i++)
            {
                for (int j = 0; j < pairs2.Count; j++)
                {
                    if (pairs1[i] == pairs2[j])
                    {
                        intersection++;
                        pairs2.RemoveAt(j);//Must remove the match to prevent "GGGG" from appearing to match "GG" with 100% success
    
                        break;
                    }
                }
            }
    
            return (2.0 * intersection) / union;
        }
    
        /// <summary>
        /// Gets all letter pairs for each
        /// individual word in the string
        /// </summary>
        /// <param name="str"></param>
        /// <returns></returns>
        private List<string> WordLetterPairs(string str)
        {
            List<string> AllPairs = new List<string>();
    
            // Tokenize the string and put the tokens/words into an array
            string[] Words = Regex.Split(str, @"\s");
    
            // For each word
            for (int w = 0; w < Words.Length; w++)
            {
                if (!string.IsNullOrEmpty(Words[w]))
                {
                    // Find the pairs of characters
                    String[] PairsInWord = LetterPairs(Words[w]);
    
                    for (int p = 0; p < PairsInWord.Length; p++)
                    {
                        AllPairs.Add(PairsInWord[p]);
                    }
                }
            }
    
            return AllPairs;
        }
    
        /// <summary>
        /// Generates an array containing every 
        /// two consecutive letters in the input string
        /// </summary>
        /// <param name="str"></param>
        /// <returns></returns>
        private string[] LetterPairs(string str)
        {
            int numPairs = str.Length - 1;
    
            string[] pairs = new string[numPairs];
    
            for (int i = 0; i < numPairs; i++)
            {
                pairs[i] = str.Substring(i, 2);
            }
    
            return pairs;
        }
    }
    
        
    25.03.2014 / 06:09
    1

    The most widely used algorithm for this is the famous "soundex". It basically consists of a code with a letter and 3 numbers. There are variations of this algorithm, according to the phonetics of each country. Tahiti, for example, which has only 8 consonants, will have greater difficulty in comparing words.

    The following xyyy representation, common in the implementation of the English language, (a consonant and three vowels) would look like this:

    First choose a table to use, or create your own, language-based table. Example:

    • 1 B F P V
    • 2 C G J K
    • 2 Q S X Y
    • 3 D T
    • 4 L LH
    • 5 M N
    • 6 R RR

    Another sample table (I use a variation of this):

    • 1 T D
    • 2 N NH
    • 3 M
    • 4 R RR RH
    • 5 L LH
    • 6 G J
    • 7 C K
    • 8 F W V
    • 9 P B
    • 0 S SS SH

    Now, to add, use the principles below:

    • all constant will have a corresponding number;
    • any vowel and punctuation will be ignored;
    • the first letter found will be represented by x;
    • After the 3 consonants are filled, the additional ones will be ignored;
    • Duplicate consonants will be unified;
    • Short names will have zeros added until they complete the xyyy representation.

    Once this is done, just add the values.

    For example, Americanas, will have the following value M-472. Because? Vowels are ignored: mrcns First letter: M Next consonants and their values: 4 (R) 7 (C) 2 (N)

    If the person types americans, they will have the following value M-472: Ignored vowels: mrccns First letter: M Next: 4 (R) 77 (C, two consonants are transformed into 1) 2 (N).

    That is, the sound of Americans and Americans, America, Americanized, are identical. Now, just create your own variation as you need it.

        
    25.03.2014 / 02:01
    0

    In java you have the contains () method of the class String makes the comparison to see if a string is present in the other:

    String a = "Teste123";
    String b = "123";
    
    if(a.contains(b) == true){
    System.out.println("Encontrado!");
    }
    

    Another alternative would be to get all the letters of the Strings and compare the letters separately:

    String a = "Cidade";
    String b = "idade";
    int auxiliar;
    
    for(int i = 0;i<a.lenght();i++){
    if(existeLetra(a.substring(i),b)){
       auxiliar++;
    }
    }
    
    if(a.lenght - auxiliar < 2){
    System.out.println("Parecido!");
    }
    
    public boolean existeLetra(char letra,String palavra){
    return palavra.contains(letra); // retorna true or false
    }
    

    Just to compare differences in uppercase and lowercase you can use the ignoreCase () method.

    String a = "Cidade";
    String b = "cidade";
    
    if(b.ignoreCase().equals(a.ignoreCase)){
       System.out.println("OK!");
    }
    
        
    25.03.2014 / 01:57