How can I make a query with LIKE or REGEXP ignoring table words?

4

I have the following data in a table

-------------------
**usuarios**
-------------------
nome
------------------
Wallace de Souza Vizerra
------------------------
Gustavo Carmo da Costa

I need to return the records of the usuarios table that contains the given value in the nome field, however I need to ignore some existing words in the values of the nome

Example:

SELECT * FROM usuarios WHERE nome LIKE 'Wallace Souza Vizerra'
#Sem o "de" no nome

I would like the query to nome to ignore the words de , da , dos , das in the database.

How can I make this query, with LIKE or REGEXP, excluding words?

    
asked by anonymous 14.08.2015 / 18:45

3 answers

2

I had a similar problem here in my company. Basically, here we use two ways to identify names with typos.

The first is the levenshtein distance and the second is the soundex () function.

LEVENSHTEIN

Paraphrasing Wikipedia "Levenshtein distance or edit distance between two strings is given by the minimum number of operations required to transform a string into the other."

For example, the distance between "Guilherme Silva" and "Guilherme da Silva" is 3. The distance between "Maria Dores" and "Maria das Dores" is 4.

Here is the code to create the levenshtein () function in Mysql.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;

To use, in your case, you can make the following query:

SELECT * FROM usuario as us WHERE levenshtein(us.nome, 'Wallace Silva') < 5 //Ou outro indice. Coloquei 5 para que ele pegue somente nomes que tenham no maximo indice 4 na distancia levenshtein.

If you have the following records in the database

--------------------
1 Wallace Silva
2 Wallace João da Silva
3 Wallace das Silva
4 Guilherme da Silva
--------------------

The query will return, in this case, only records 1 and 2.

SOUNDEX

The soundex () function is a function that can be used in database searches where you know the pronunciation but not exactly how it is written.

Soundex () is already native to Mysql and we do not need to create it manually.

In your case, the query looks something like this:

SELECT * FROM pessoa WHERE soundex(nome) = soundex('Wallace Silva')

The problem with using soundex () is that it works best with English words and does not work 100% in English.

COMBINED MODE

We can also create a query in which we combine the two forms of search so that we can have a greater probability of success.

We can use it like this:

SELECT * FROM pessoa WHERE levenshtein(soundex(nome), soundex('Wallace Silva')) < 3

Or even so:

SELECT * FROM pessoa WHERE (levenshtein(nome, 'Wallace Silva') < 5) OR soundex(nome) = soundex('Wallace Silva')

SOME CONSIDERATIONS

  • When we use more functions, we need more processing on the machine, that is, if we create a query 'little monster', it may take a while to run
  • The two functions have some bad points, I recommend you study them well and test VERY much before implementing some code
  • So far, these were the most effective ways to find similar names directly in the SQL query that I found and that served me. It does not mean that there are no better ones, it costs nothing to give a Googleada;)

At last, I hope I have helped! Good profit!

    
14.08.2015 / 20:02
1

Maybe not the most appropriate way, but even for the sake of performance (searching 'wildcard' values can be costly), my first idea for this problem would be to store a 'corrected' helper field without the terms you want to eliminate. The search would be done in this field (after also correcting the search string).

So we would have

Nome                NomeTratado
____                _____________
Ricardo de Melo     Ricardo Melo

Again, this would be my first idea. And I think there's a good chance that there's a better way.

Another approach would be to create an index with each part of the name. This may be interesting to make the search return values when the person types only the last name.

    
14.08.2015 / 19:50
0

I do not quite understand, but you could do a like after% after typing the last character.

Select nome From usuarios Where nome Like 'Wallace%'

or for each space you change it by the character "%" or before doing the Sql you change these prepositions and replace them with%.

There are several ways to do this.

    
14.08.2015 / 18:50