Search for word variations

2

I have a sentence that I need to check if it meets a rule but there may be variation in writing (accentuation, more or less spaces, ...)

Example:

string fraseProcurada = "Cadastro de Usuários - SP";

if (fraseRecebida.Contains(fraseProcurada)
   //recebi a frase que procurava

however, in this example, the user may type:

  • User registration
  • user registration
  • User registration - São Paulo
  • SP user registration
  • User registration - S.P.

A certain amount of variation that actually meets what I'm looking for. Well, I thought first of doing an array with these possible forms but I do not know if there is something more certain and easy to do (RegEx type).

Any suggestions?

Thank you.

    
asked by anonymous 04.04.2018 / 15:55

3 answers

2

Some comparison would not be possible, like using "SP" or even "São Paulo", because there is no generic code enough, such as Contains or IndexOf identify this, know for example that "SP" means the same as "São Paulo".
In such cases, you would need to create a dictionary that would report similarities to help. For cases where the comparison is case sensitive, using all variations of CompareOptions together with IndexOf already solves many cases. This can be done like this:

static bool Comparar(string texto, string textoAComparar)
{
    var index = CultureInfo.InvariantCulture.CompareInfo.IndexOf
        (texto, textoAComparar, CompareOptions.IgnoreCase | 
         CompareOptions.IgnoreSymbols | CompareOptions.IgnoreNonSpace);
    return index != -1;
}

This will suit most cases. Here is an example of working code: link

    
04.04.2018 / 18:21
1

You can normalize the word by making all the characters in your version without accent and box. You can use the System.Text namespace to perform character translation:

string s1 = new String(); 
string s2 = null;
s2 = s1.Normalize(NormalizationForm.FormC).toLowerCase();
    
04.04.2018 / 16:03
1

You can add a helper class to these treatments by adding these methods to the string type and adding the methods to the treatments you want, removing any characters you find relevant. See the example:

public static class StringHelper
{

    public static string RemoverAcentos(this string texto)
    {
        StringBuilder retorno = new StringBuilder();
        var arrTexto =
            texto.Normalize(NormalizationForm.FormD).ToCharArray();

        foreach (var letra in arrTexto)
        {
            if (System.Globalization.CharUnicodeInfo.GetUnicodeCategory(letra) !=
                System.Globalization.UnicodeCategory.NonSpacingMark)
                retorno.Append(letra);
        }
        return retorno.ToString();
    }

    public static string RemoverEspacamentos(this string texto)
    {       
        string retorno = texto.Replace("\t", "").Replace(" ", "");
        return retorno.ToString();
    }

    public static string RemoverCaracteresEspeciais(this string texto) {
        string retorno = texto.RemoverAcentos();
        retorno = Regex.Replace(retorno.ToLower(), @"[^a-z0-9\s*]", "");
        return retorno;
    }

}

And use as follows:

string entrada = "São Paulo SP";
string entradaNormalizada = entrada.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

string cadastro = "Cidade de São Paulo - SP";
string cadastroNormalizado = cadastro.RemoverCaracteresEspeciais()
                            .RemoverEspacamentos()
                            .ToLower();

bool comparacao = cadastroNormalizado.Contains(entradaNormalizada); // true

Still this is only the first part of your journey, for even after these basic treatments you will only get positive results when the entry is less than the base if you compare and are in the same order. If the entry is for example "I live in the city of são paulo" or "SP - São Paulo". The comparison will be false.

From this point you should enrich your engine to work with a hit score, comparing how many A terms there are in B and making your decision to validate the comparison.

But you need something more sophisticated will need to implement a search API that meets your needs, such as Lucene or < a href="https://github.com/reddog-io/RedDog.Search"> RedDog.Search .

    
04.04.2018 / 16:37