Here is a fairly simple solution proposal (I did in Java):
public class Teste {
protected static float checkSimilarity(String sString1, String sString2) throws Exception {
// Se as strings têm tamanho distinto, obtêm a similaridade de todas as
// combinações em que tantos caracteres quanto a diferença entre elas são
// inseridos na string de menor tamanho. Retorna a similaridade máxima
// entre todas as combinações, descontando um percentual que representa
// a diferença em número de caracteres.
if(sString1.length() != sString2.length()) {
int iDiff = Math.abs(sString1.length() - sString2.length());
int iLen = Math.max(sString1.length(), sString2.length());
String sBigger, sSmaller, sAux;
if(iLen == sString1.length()) {
sBigger = sString1;
sSmaller = sString2;
}
else {
sBigger = sString2;
sSmaller = sString1;
}
float fSim, fMaxSimilarity = Float.MIN_VALUE;
for(int i = 0; i <= sSmaller.length(); i++) {
sAux = sSmaller.substring(0, i) + sBigger.substring(i, i+iDiff) + sSmaller.substring(i);
fSim = checkSimilaritySameSize(sBigger, sAux);
if(fSim > fMaxSimilarity)
fMaxSimilarity = fSim;
}
return fMaxSimilarity - (1f * iDiff) / iLen;
// Se as strings têm o mesmo tamanho, simplesmente compara-as caractere
// a caractere. A similaridade advém das diferenças em cada posição.
} else
return checkSimilaritySameSize(sString1, sString2);
}
protected static float checkSimilaritySameSize(String sString1, String sString2) throws Exception {
if(sString1.length() != sString2.length())
throw new Exception("Strings devem ter o mesmo tamanho!");
int iLen = sString1.length();
int iDiffs = 0;
// Conta as diferenças entre as strings
for(int i = 0; i < iLen; i++)
if(sString1.charAt(i) != sString2.charAt(i))
iDiffs++;
// Calcula um percentual entre 0 e 1, sendo 0 completamente diferente e
// 1 completamente igual
return 1f - (float) iDiffs / iLen;
}
public static void main(String[] args) {
try {
System.out.println("'ABCD' vs 'ab' = " + checkSimilarity("ABCD", "ab"));
System.out.println("'cidade' vs 'cdade' = " + checkSimilarity("cidade", "cdade"));
System.out.println("'cidade' vs 'ciDade' = " + checkSimilarity("cidade", "ciDade"));
System.out.println("'cidade' vs 'cdiade' = " + checkSimilarity("cidade", "cdiade"));
System.out.println("'cidade' vs 'edadic' = " + checkSimilarity("cidade", "edadic"));
System.out.println("'cidade' vs 'CIDADE' = " + checkSimilarity("cidade", "CIDADE"));
System.out.println("'cidade' vs 'CIdADE' = " + checkSimilarity("cidade", "CIdADE"));
System.out.println("'cidade' vs 'CdADE' = " + checkSimilarity("cidade", "CdADE"));
} catch (Exception e) {
e.printStackTrace();
}
}
}
The principle of the algorithm is quite simple:
- If the strings are exactly the same size, the
checkSimilarity
method will simply invoke the checkSimilaritySameSize
base method, which compares character strings from left to right, counts the number of errors ) and returns a percentage of errors in relation to the size of the strings.
- If, on the other hand, strings have different sizes, the
checkSimilarity
method does the same test for every possible combination in which as many characters as the difference are included in the largest string in all positions of the smaller string. For example, supposing string 1 as "ABCD" (size = 4) and string 2 as "ab" (size = 2), the combinations will be "ABab", "aBCb" and "abCD".
- Among the similarities calculated between the combinations and the largest string (since both are now the same size), the method chooses the maximum value, but deducts an "error rate" proportional to the number of characters in the difference .
So, the result of running the given code is as follows for these examples:
'ABCD' vs 'ab' = 0.0
'cidade' vs 'cdade' = 0.8333333
'cidade' vs 'ciDade' = 0.8333333
'cidade' vs 'cdiade' = 0.6666666
'cidade' vs 'edadic' = 0.0
'cidade' vs 'CIDADE' = 0.0
'cidade' vs 'CIdADE' = 0.16666669
'cidade' vs 'CdADE' = 0.16666664
The closer to 1, the more similar the strings are, and the closer to 0 the more distinct they are. If you do not want to differentiate uppercase (note the example on the third line), just convert both strings to greater (with String::toUpperCase()
) before comparing them.
Note that this solution is very simple because it admits that errors (if any) are due to the lack of one or more characters in sequence. You can improve the algorithm so that it considers all possible real character combinations, but it will probably be like using a bazooka to kill an ant.