Text Cleanup and Standardization

3

I have an Excel table already in the R with two columns (code and company name) and 22,000 rows. The names have been filled out freely, so the same name is written in different forms, abbreviated and in full.

I've been able to turn everything into a box, but I need to turn the names to a single pattern, so I can not repeat the same company with different descriptions.

Examples of situations of different descriptions for the same company:

  • ADASA/DF , ADASA-AGENCIA DE AGUAS , ADASA-DF It would be transformed to ADASA ;

  • / p>
  • I researched similar questions on the site, but I did not find an answer. Since I am a beginner in programming and in R, you may not know the best term to search for.

    After standardizing the texts in the "company name" column, I need to achieve as a result all the codes associated with that company, without repeating the name of the company.

        
    asked by anonymous 16.08.2016 / 20:48

    1 answer

    2

    There is no magic way to do this. I'm going to propose a way that might speed up your manual work, because at some point you will have to check and see which ones are well graded or not.

    With the following code:

    combinar_textos_parecidos <- function(x, max_dist, method = "lv"){
      x <- as.character(x)
      distancias <- stringdist::stringdistmatrix(x, x, method = method)
      for(i in 1:length(x)){
        small_dist <- distancias[i,] < max_dist
        if(sum(small_dist) > 1){
          x[small_dist] <- x[which(small_dist)[1]] 
        }
      }
      return(x)
    }
    

    Copy of the question linked and adapted to choose the distance calculation method.

    library(stringr)
    textos %>% str_replace_all("-", " ") %>%
      str_replace_all("[:space:]{1,}", " ") %>%
      combinar_textos_parecidos(0.4, method = "jaccard") 
    
     [1] "ADASA/DF"               "ADASA AGENCIA DE AGUAS" "ADASA/DF"               "CASA CIVIL"            
     [5] "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"            
     [9] "CASA CIVIL"             "CASA CIVIL"             "CASA CIVIL"            
    

    That way, in your example you would need to change only 1 element of your vector.

    Of course, this is just an alternative. You could also create an auxiliary table with all possible companies and make a join using the fuzzyjoin package. For some reason the package does not join when one of the tables has only one column (I'll open an issue in github), but you can use it.

    Example:

    textos <- data.frame(emp = c("ADASA/DF",  "ADASA-AGENCIA DE AGUAS", "ADASA-DF",
                                              "CASA CIVIL", "CASA CIVIL - DF", "CASA CIVIL DA GOVERNADORIA DF", "CASA CIVIL DO DF", "CASA CIVIL DO DF.", "CASA CIVIL DO DIST. FEDERAL", "CASA CIVIL DO DISTRITO FEDERAL", "SEC. ESTADO DA CASA CIVIL"),
                         stringsAsFactors = F, a = 1
    )
    
    empresas <- data.frame(emp = c("ADASA", "CASA CIVIL"), emp2 = c("ADASA", "CASA CIVIL"), stringsAsFactors = F)
    library(fuzzyjoin)
    regex_left_join(textos, empresas, by = c(emp = "emp"))
    
                                emp.x a      emp.y       emp2
    1                        ADASA/DF 1      ADASA      ADASA
    2          ADASA-AGENCIA DE AGUAS 1      ADASA      ADASA
    3                        ADASA-DF 1      ADASA      ADASA
    4                      CASA CIVIL 1 CASA CIVIL CASA CIVIL
    5                 CASA CIVIL - DF 1 CASA CIVIL CASA CIVIL
    6   CASA CIVIL DA GOVERNADORIA DF 1 CASA CIVIL CASA CIVIL
    7                CASA CIVIL DO DF 1 CASA CIVIL CASA CIVIL
    8               CASA CIVIL DO DF. 1 CASA CIVIL CASA CIVIL
    9     CASA CIVIL DO DIST. FEDERAL 1 CASA CIVIL CASA CIVIL
    10 CASA CIVIL DO DISTRITO FEDERAL 1 CASA CIVIL CASA CIVIL
    11      SEC. ESTADO DA CASA CIVIL 1 CASA CIVIL CASA CIVIL
    
        
    16.08.2016 / 22:02