Is there any way to eliminate duplicate elements that are not exactly the same?

2
dados1 <- c("10 ANOS DA POLÍTICA NACIONAL DE PROMOÇÃO DA SAÚDE: TRAJETÓRIAS E DESAFIOS", "4-CYCLOPROPYL-1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE AND ETHYL 1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE-4-CARBOXYLATE","7,7-DIMETHYLAPORPHINE AND OTHER ALKALOIDS FROM THE BARK OF", "ABSCESSO DO MÚSCULO PSOAS ASSOCIADO À INFECÇÃO POR MYCOBACTERIUM TUBERCULOSIS EM PACIENTE COM AIDS", "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE","ABUNDANCE OF LUTZOMYIA LONGIPALPIS", "ABUSO E DEPENDÊNCIA DE DROGAS NA PERSPECTIVA DA SAÚDE PÚBLICA (EDITORIAL)")

qualis <- c("A2", "B3", "A1", "B2", "A2", "A2", "A1")

m <- data.frame("Título da Produção" = dados1,
                "Qualis" = qualis,
                "Ano" = c(2010:2016))

The above df is only illustrative. Note that the fifth and sixth element of "data1" are pretty much the same thing, but since they are not written in the same way I can not use duplicated or unique.

Is there any other option to clean these lines, filtering by name?

    
asked by anonymous 08.08.2016 / 18:46

1 answer

2

I've done a function that can help you. It uses the stringdist package that calculates the distance between strings:

combinar_textos_parecidos <- function(x, max_dist){
  x <- as.character(x)
  distancias <- stringdist::stringdistmatrix(x, x)
  for(i in 1:length(x)){
    small_dist <- distancias[i,] < max_dist
    if(sum(small_dist) > 1){
      x[small_dist] <- x[which(small_dist)[1]] 
    }
  }
  return(x)
}

See what it returns when I apply it to its Título.da.Produção vector. Now items 5 and 6 have exactly the same name.

combinar_textos_parecidos(m$Título.da.Produção, 10)
[1] "10 ANOS DA POLÍTICA NACIONAL DE PROMOÇÃO DA SAÚDE: TRAJETÓRIAS E DESAFIOS"                                                                            
[2] "4-CYCLOPROPYL-1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE AND ETHYL 1-(1-METHYL-4-NITRO-1H-IMIDAZOL-5-YL)-1H-1,2,3-TRIAZOLE-4-CARBOXYLATE"
[3] "7,7-DIMETHYLAPORPHINE AND OTHER ALKALOIDS FROM THE BARK OF"                                                                                           
[4] "ABSCESSO DO MÚSCULO PSOAS ASSOCIADO À INFECÇÃO POR MYCOBACTERIUM TUBERCULOSIS EM PACIENTE COM AIDS"                                                   
[5] "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE"                                                                                                             
[6] "ABUNDANCE OF LUTZOMYIA LONGIPALPIS TESTE"                                                                                                             
[7] "ABUSO E DEPENDÊNCIA DE DROGAS NA PERSPECTIVA DA SAÚDE PÚBLICA (EDITORIAL)" 

So doing:

m$Título.da.Produção <- combinar_textos_parecidos(m$Título.da.Produção, 10)
m[!duplicated(m$Título.da.Produção),]

Line 5 would be excluded.

Obs I used distance 10 as the cut-off point. You may want to be more or less tolerant of the proximity of the strings. To do this, just control the max_dist parameter of my function.

You can read more about calculating distances here or typing help("stringdist-metrics") on your R console.

    
08.08.2016 / 20:09