How to compare if the contents of two string columns of a data frame are similar

3

I have a date frame where I need to compare how much the contents of two columns look like.

For example: coluna a = “José Luiz da Silva” and coluna b = “José L. Silva” . How can I do to indicate that column a and column b are similar?

    
asked by anonymous 01.07.2017 / 03:42

2 answers

1

Here's a possible solution in Python:

#-*- coding: utf-8 -*-
from unidecode import unidecode

ignore_list = ['de', 'do', 'da', 'dos', 'das']

def parse_name(full_name):
    name_list = full_name.split() # Separa cada nome
    new_name_list = []
    for name in name_list: # Percorre cada nome
        name = name.strip('.') # Remove pontos
        name = name.lower() # Converte todas as letras em minúsculas
        if name in ignore_list: # Remove preposições
            continue
        name = unidecode(name.decode('utf8')) # Remove acentos (necessita da biblioteca 'unidecode')
        new_name_list.append(name)
    return new_name_list

def is_similar(a, b):
    a = parse_name(a)
    b = parse_name(b)
    if len(a) != len(b): # Se o número de palavras for diferente, retorna falso
        return False
    for x, y in zip(a, b):
        if (len(x) == 1) or (len(y) == 1): # Se uma das palavras possuir apenas uma letra...
            if x[0] != y[0]: #...compara apenas a primeira letra
                return False
        else: # Caso contrário...
            if x != y: #...compara a palavra toda
                return False
    return True # Se todas as palavras forem iguais, retorna verdadeiro

Example usage:

a = 'José Luiz da Silva'
b = 'José L. Silva'
print is_similar(a, b) # Retorna True

In this solution, the is_similar() function returns only true or false . Depending on your need, it might be interesting to think of a more flexible metric that returns a measure of distance. For example:

  • Names like 'José L. Silva' and 'José Luiz da Silva' would have distance 0 (would be considered equal);
  • Names like 'José Silva' and 'José Luiz da Silva' would have a small distance value (would be considered similar);
  • Names like 'José Silva' and 'Maria Souza' would have a large distance value (would be considered quite different).
01.07.2017 / 18:06
1

( TL; DR )

Testing the similiarity rate between two strings:

# Testando taxa de similaridade
from difflib import SequenceMatcher
def sml(x,y):
    return SequenceMatcher(None, x, y).ratio()

x = 'José Luiz da Silva'
y = 'José L. Silva'
msg = "Taxa de similaridade "

print(msg, 'entre x e y: ', sml(x,y) )
print(msg, 'entre x e x: ', sml(x,x) )

Output:

Taxa de similaridade  entre x e y:  0.7741935483870968
Taxa de similaridade  entre x e x:  1.0

Run the code on repl.it.

    
01.07.2017 / 22:25