Compare rows in a column

Question

Compare rows in a column

Navigation

#1 by (4 votes)

2

I have a spreadsheet with about 15k lines, my problem is that in one of the most important columns for my analysis, the data is arranged in a very confusing way. The column is about the complaint that led to a lawsuit, for example: moral damages.

It happens that on all lines the same type of complaint is written in different ways (eg moral damage, moral damages), can someone give me a light on how to normalize this column?

In this third part of the image are the motivations of the process, and as you can see there are thousands of variables.

r

asked by anonymous 26.09.2017 / 20:10

1 answer

Generate NFe Key What does window.location.reload do?

score 4 · Accepted Answer

Huuum, I've been on your side already. I'd say you have two options:

Based on an analysis of the terms used, create a function that normalizes them, in the style of billiards CASE WHEN of SQL , or if_else of R or dplyr::case_when . >
Based on similarity of strings I would try to put those that have more than a certain number of characters and more than x% of similarity with the same term. In R has the stringdist package that does this.

Option 1 seems to me to be more "correct" in terms of criteria for classifying the motifs. So I'll tell you a bit more about how she would do it.

How would I do option 1

As would be a word count analysis, I'd use the tidytext package resources to count those words and get the 2-grams and 3-grams that are most common and try to put the most similar ones (and equal depending on your knowledge of domain). With that I think you could kill a lot of the problem. The very granular cases, I would put into a category 'others' without weight in consciousness.

What would the code look like?

I'll leave a snippet of how I would write this code, using the stringr , dplyr and purrr packages.

library(dplyr)
library(purrr)
library(stringr)

## dados para teste
toy_df <- tibble(motivo = c('danos morais', 'dano moral', 'danos materiais', 'dano x'))

## dados para facilitar a criação de vetores
make_vector <- function(string) stringr::str_split(string, pattern = ', ') %>% purrr::as_vector()


## a transformação em si

toy_df %>% 
mutate(classe = case_when(
  motivo %in% make_vector('dano moral, danos morais, danos whatever') ~ 'dano_moral',
  motivo %in% make_vector('danos materiais, dano material') ~ 'dano_material',       
  TRUE ~ 'outros'     
))

I prefer the use of dplyr::case_when because it is vectorized and ends up processing voluminous data a little faster. The syntax, however strange, is more functional than the if_else (where you would be repeating if_else for everything that is side).

The make_vector function is there just to make it easier to create string vectors with a cleaner syntax. In a perfect world these vectors that served to put the motifs in the same classes would come out of an analysis of yours, or something as.