This problem is quite complex and needs two stages. The first stage is to correct typos of a database (perhaps a probabilistic solution). The second stage is to sort this database after this fix. This second stage requires a sequence of dplyr package applications (or another appropriate and elegant package)
Let's go to the first stage. I have a database of a company. The database provided does not fully reveal the identity of the worker. I will illustrate the base and then explain the variables.
data <- read.table(text="
cpf;nome;m1;m2;m3;m4;m5;m6;m7;m8;m9;m10;m11;m12;salario
100001;Maria dos Santos Magalhães;1;0;0;0;0;0;0;0;1;0;0;0;1234
100001;Maria Santos Magalhães;0;1;1;1;1;1;1;1;0;1;1;1;1034
100002;Lucas Barbosa;1;1;1;1;1;1;1;1;1;1;1;1;4234
100002;Danilo Carvalho;1;1;1;1;1;1;1;1;1;1;1;0;7234
100003;Paulo Silva de Fonseca;0;1;1;1;1;1;1;1;1;1;1;0;1254
100003;Paulo Silva da Fonseca;0;0;0;0;0;0;0;0;0;0;0;1;2234
100003;Wagner Silva Junior;1;1;1;0;0;0;0;0;0;0;0;0;4234
100003;Paulo Silva Fonseca;1;0;0;0;0;0;0;0;0;0;0;0;1232
100004;Ricardo Colho;1;1;1;1;1;1;1;0;1;1;1;0;5234
100004;Ricardo Coelho;0;0;0;0;0;0;0;1;0;0;0;1;1234", h=T, sep=";")
Explaining the variables. First, we do not have the full cpf, we only have the 6 middle numbers. The variable "name" does not require explanations. Variables of type m1, m2, m3, etc. are the months. These variables are binary and 1 represents that the worker worked on the month in question and 0 that did not work. The variable "salary" is the value that the worker earned in the worked months. The data presented here are fictitious.
First thing to look at by looking at each group of cpfs is typing errors. For example, the group whose cpf medium number is 100001, we have a good chance that Maria dos Santos Magalhães and Maria Santos Magalhães are the same person. Another evidence is that if it were two different people, they would probably have months of work in common, as is the case of cpf 100002, where Lucas Barbosa and Danilo Carvalho are different people. The other cases follow the same explanation.
I need some algorithm that tells me, for example, that Maria dos Santos Magalhães and Maria Santos Magalhães are, as high probability, the same person. Just like Lucas Barbosa and Danilo Carvalho are practically different people.
An attempt using adist:
teste<- data[data$cpf == 100003 , ]
(ch1<- teste$nome)
[1] Paulo Silva de Fonseca Paulo Silva da Fonseca Wagner Silva Junior
[4] Paulo Silva Fonseca
10 Levels: Danilo Carvalho Lucas Barbosa ... Wagner Silva Junior
(d1 <- ch1 %>% adist())
[,1] [,2] [,3] [,4]
[1,] 0 1 14 3
[2,] 1 0 14 3
[3,] 14 14 0 11
[4,] 3 3 11 0
I'll exclude those with zero distance and less than 5 as default. But first I'll name the rows and columns.
(d1<- as.data.frame(d1))
names(d1)<- ch1
row.names(d1)<- ch1
thresh=5
(teste<- which(d1 != 0 & d1 < thresh, arr.ind=TRUE) )
row col
Paulo Silva da Fonseca 2 1
Paulo Silva Fonseca 4 1
Paulo Silva de Fonseca 1 2
Paulo Silva Fonseca 4 2
Paulo Silva de Fonseca 1 4
Paulo Silva da Fonseca 2 4
Please note that in this particular case, Wagner Silva Junior has no connection to others. From now on, the second stage begins: With this array of distances, I would like to make a series of manipulations in order to sort out the names, the months worked and the salary. In short, I would like something like this:
cpf nome m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 salario
2 100001 Maria Santos Magalhães 1 1 1 1 1 1 1 1 1 1 1 1 2268
3 100002 Lucas Barbosa 1 1 1 1 1 1 1 1 1 1 1 1 4234
4 100002 Danilo Carvalho 1 1 1 1 1 1 1 1 1 1 1 0 7234
5 100003 Paulo Silva de Fonseca 1 1 1 1 1 1 1 1 1 1 1 1 4720
7 100003 Wagner Silva Junior 1 1 1 0 0 0 0 0 0 0 0 0 4234
9 100004 Ricardo Colho 1 1 1 1 1 1 1 1 1 1 1 1 6468
I believe that a series of functions using dplyr can solve this second stage