Compare rows in a column

2

I have a spreadsheet with about 15k lines, my problem is that in one of the most important columns for my analysis, the data is arranged in a very confusing way. The column is about the complaint that led to a lawsuit, for example: moral damages.

It happens that on all lines the same type of complaint is written in different ways (eg moral damage, moral damages), can someone give me a light on how to normalize this column?

In this third part of the image are the motivations of the process, and as you can see there are thousands of variables.

    
asked by anonymous 26.09.2017 / 20:10

1 answer

4

Huuum, I've been on your side already. I'd say you have two options:

  • Based on an analysis of the terms used, create a function that normalizes them, in the style of billiards CASE WHEN of SQL , or if_else of R or dplyr::case_when . >

  • Based on similarity of strings I would try to put those that have more than a certain number of characters and more than x% of similarity with the same term. In R has the stringdist package that does this.

Option 1 seems to me to be more "correct" in terms of criteria for classifying the motifs. So I'll tell you a bit more about how she would do it.

How would I do option 1

As would be a word count analysis, I'd use the tidytext package resources to count those words and get the 2-grams and 3-grams that are most common and try to put the most similar ones (and equal depending on your knowledge of domain). With that I think you could kill a lot of the problem. The very granular cases, I would put into a category 'others' without weight in consciousness.

What would the code look like?

I'll leave a snippet of how I would write this code, using the stringr , dplyr and purrr packages.

library(dplyr)
library(purrr)
library(stringr)

## dados para teste
toy_df <- tibble(motivo = c('danos morais', 'dano moral', 'danos materiais', 'dano x'))

## dados para facilitar a criação de vetores
make_vector <- function(string) stringr::str_split(string, pattern = ', ') %>% purrr::as_vector()


## a transformação em si

toy_df %>% 
mutate(classe = case_when(
  motivo %in% make_vector('dano moral, danos morais, danos whatever') ~ 'dano_moral',
  motivo %in% make_vector('danos materiais, dano material') ~ 'dano_material',       
  TRUE ~ 'outros'     
))

I prefer the use of dplyr::case_when because it is vectorized and ends up processing voluminous data a little faster. The syntax, however strange, is more functional than the if_else (where you would be repeating if_else for everything that is side).

The make_vector function is there just to make it easier to create string vectors with a cleaner syntax. In a perfect world these vectors that served to put the motifs in the same classes would come out of an analysis of yours, or something as.

    
26.09.2017 / 22:35