How to do similar word searches in SQL?

8

Let's suppose I have the following data in a table:

  

John

     

Pedro

     

Ronaldo

     

Luiz

If I use a query %like% it finds the user typing strictly. For example, if he types ron he finds Ronaldo . But if he types a different letter, ex: ronldo (typo) he does not think. But Google (and others) often find these words similar.

Is there any way to do this in SQL, regardless of bank?

    
asked by anonymous 02.06.2014 / 13:40

2 answers

9

In some databases you can do this type of search which is called Fuzzy Matching .

You can get this result using the soundex() and difference() functions. I know they exist in SQL Server and from a quick search I found that it is present in PostgreSQL through the module fuzzystrmatch .

The function soundex() returns a 4-character code of a passed string and the difference() function compares the difference between these codes, on a level from 0 to 4, where 0 is the exact match of the code: p>

CREATE TABLE s (nm text);

INSERT INTO s VALUES ('john');
INSERT INTO s VALUES ('joan');
INSERT INTO s VALUES ('wobbly');
INSERT INTO s VALUES ('jack');
INSERT INTO s VALUES ('ronaldo');

SELECT soundex('john') AS john, soundex('joan') AS joan, soundex('jack') AS jack; 
-- Retorna 'J500', 'J500' e 'J200'

SELECT * FROM s WHERE soundex(nm) = soundex('john');
-- Casa com 'jonh' e 'joan'    

SELECT * FROM s WHERE difference(s.nm, 'john') > 2;
-- Casa com 'jonh', 'joan' e 'jack'

SELECT * FROM s WHERE soundex(nm) = soundex('ronldo');
-- Casa com 'ronaldo'

SQLFiddle Example

In some simple cases, soundex() can solve your problem. According to the PostgreSQL documentation, this function has better results with English names.

For more complex implementations, there are other similarity comparison methods, such as metaphone() or levenshtein() (See examples in the fuzzystrmatch ).

In the most complex cases I recommend applying an index of type Full-Text Index for better performance on large databases.

See the documentation for the functions soundex() and difference() in SQL Server.

    
02.06.2014 / 14:22
3

Adding some nuances to the good response from @gmsantos ...

Metaphone for Portuguese names

In this question has been widely discussed the phonetic algorithm for Portuguese, which is more efficient than mathematical similarity of % quoted, or distances such as Hamming , # and others, which measure the similarity between any strings (even in genetics they use).

The question is in the direction of a more practical and already classic problem: grouping or equating proper names (street names, names of people, etc.). For example, "Joao"="joao", "Sylvia"="silvia", "Luíz"="luis", etc.

The experience of anyone who has ever worked ( documented in this article ) shows that the errors more frequent, spelling of names, has its origin in the spelling mistakes we make when we try to transcribe only what we hear. So focus on phonetics.

And the phonetics of Portuguese speakers are not phonetics of English speakers ... So the best solution is the best phonetic algorithm adapted to Portuguese ... And that exists!

This is the MetaphonePtBr .
(If you do not have access to install external functions on your server, generic difference fault is also still higher than Metaphone ).

In PostgreSQL (8.X or 9.X), once installed just do

SELECT metaphone_ptbr('Sylvia')=metaphone_ptbr('sillvya');
-- retorna TRUE  ('SV'=='SV')
SELECT metaphone_ptbr('Sylveira')=metaphone_ptbr('sillvya');
-- retona FALSE ('SVR'!='SV')
The great advantage of this method is that the comparison can be "cached", that is, part of the process can be stored before in the database (the metaphone of all names), so that the search for a given name , or grouping similar ones, is much faster than peer-to-peer evaluation by string similarity functions.

Since it allows the grouping, noma database with 1000 names for example, one can reduce the analysis to a group of 10 or 20 names, and on them to apply the more sophisticated functions (cost more CPU) of similarity of string .

    
28.08.2014 / 01:59