Here's a possible solution in Python:
#-*- coding: utf-8 -*-
from unidecode import unidecode
ignore_list = ['de', 'do', 'da', 'dos', 'das']
def parse_name(full_name):
name_list = full_name.split() # Separa cada nome
new_name_list = []
for name in name_list: # Percorre cada nome
name = name.strip('.') # Remove pontos
name = name.lower() # Converte todas as letras em minúsculas
if name in ignore_list: # Remove preposições
continue
name = unidecode(name.decode('utf8')) # Remove acentos (necessita da biblioteca 'unidecode')
new_name_list.append(name)
return new_name_list
def is_similar(a, b):
a = parse_name(a)
b = parse_name(b)
if len(a) != len(b): # Se o número de palavras for diferente, retorna falso
return False
for x, y in zip(a, b):
if (len(x) == 1) or (len(y) == 1): # Se uma das palavras possuir apenas uma letra...
if x[0] != y[0]: #...compara apenas a primeira letra
return False
else: # Caso contrário...
if x != y: #...compara a palavra toda
return False
return True # Se todas as palavras forem iguais, retorna verdadeiro
Example usage:
a = 'José Luiz da Silva'
b = 'José L. Silva'
print is_similar(a, b) # Retorna True
In this solution, the is_similar()
function returns only true
or false
. Depending on your need, it might be interesting to think of a more flexible metric that returns a measure of distance. For example:
- Names like
'José L. Silva'
and 'José Luiz da Silva'
would have distance 0 (would be considered equal);
- Names like
'José Silva'
and 'José Luiz da Silva'
would have a small distance value (would be considered similar);
- Names like
'José Silva'
and 'Maria Souza'
would have a large distance value (would be considered quite different).