There is no magic way to do this. I'm going to propose a way that might speed up your manual work, because at some point you will have to check and see which ones are well graded or not.
With the following code:
combinar_textos_parecidos <- function(x, max_dist, method = "lv"){
x <- as.character(x)
distancias <- stringdist::stringdistmatrix(x, x, method = method)
for(i in 1:length(x)){
small_dist <- distancias[i,] < max_dist
if(sum(small_dist) > 1){
x[small_dist] <- x[which(small_dist)[1]]
}
}
return(x)
}
Copy of the question linked and adapted to choose the distance calculation method.
library(stringr)
textos %>% str_replace_all("-", " ") %>%
str_replace_all("[:space:]{1,}", " ") %>%
combinar_textos_parecidos(0.4, method = "jaccard")
[1] "ADASA/DF" "ADASA AGENCIA DE AGUAS" "ADASA/DF" "CASA CIVIL"
[5] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
[9] "CASA CIVIL" "CASA CIVIL" "CASA CIVIL"
That way, in your example you would need to change only 1 element of your vector.
Of course, this is just an alternative. You could also create an auxiliary table with all possible companies and make a join using the fuzzyjoin
package. For some reason the package does not join when one of the tables has only one column (I'll open an issue in github), but you can use it.
Example:
textos <- data.frame(emp = c("ADASA/DF", "ADASA-AGENCIA DE AGUAS", "ADASA-DF",
"CASA CIVIL", "CASA CIVIL - DF", "CASA CIVIL DA GOVERNADORIA DF", "CASA CIVIL DO DF", "CASA CIVIL DO DF.", "CASA CIVIL DO DIST. FEDERAL", "CASA CIVIL DO DISTRITO FEDERAL", "SEC. ESTADO DA CASA CIVIL"),
stringsAsFactors = F, a = 1
)
empresas <- data.frame(emp = c("ADASA", "CASA CIVIL"), emp2 = c("ADASA", "CASA CIVIL"), stringsAsFactors = F)
library(fuzzyjoin)
regex_left_join(textos, empresas, by = c(emp = "emp"))
emp.x a emp.y emp2
1 ADASA/DF 1 ADASA ADASA
2 ADASA-AGENCIA DE AGUAS 1 ADASA ADASA
3 ADASA-DF 1 ADASA ADASA
4 CASA CIVIL 1 CASA CIVIL CASA CIVIL
5 CASA CIVIL - DF 1 CASA CIVIL CASA CIVIL
6 CASA CIVIL DA GOVERNADORIA DF 1 CASA CIVIL CASA CIVIL
7 CASA CIVIL DO DF 1 CASA CIVIL CASA CIVIL
8 CASA CIVIL DO DF. 1 CASA CIVIL CASA CIVIL
9 CASA CIVIL DO DIST. FEDERAL 1 CASA CIVIL CASA CIVIL
10 CASA CIVIL DO DISTRITO FEDERAL 1 CASA CIVIL CASA CIVIL
11 SEC. ESTADO DA CASA CIVIL 1 CASA CIVIL CASA CIVIL