Huuum, I've been on your side already. I'd say you have two options:
-
Based on an analysis of the terms used, create a function that normalizes them, in the style of billiards CASE WHEN
of SQL
, or if_else
of R
or dplyr::case_when
. >
-
Based on similarity of strings I would try to put those that have more than a certain number of characters and more than x% of similarity with the same term. In R
has the stringdist
package that does this.
Option 1 seems to me to be more "correct" in terms of criteria for classifying the motifs. So I'll tell you a bit more about how she would do it.
How would I do option 1
As would be a word count analysis, I'd use the tidytext
package resources to count those words and get the 2-grams and 3-grams that are most common and try to put the most similar ones (and equal depending on your knowledge of domain). With that I think you could kill a lot of the problem. The very granular cases, I would put into a category 'others' without weight in consciousness.
What would the code look like?
I'll leave a snippet of how I would write this code, using the stringr
, dplyr
and purrr
packages.
library(dplyr)
library(purrr)
library(stringr)
## dados para teste
toy_df <- tibble(motivo = c('danos morais', 'dano moral', 'danos materiais', 'dano x'))
## dados para facilitar a criação de vetores
make_vector <- function(string) stringr::str_split(string, pattern = ', ') %>% purrr::as_vector()
## a transformação em si
toy_df %>%
mutate(classe = case_when(
motivo %in% make_vector('dano moral, danos morais, danos whatever') ~ 'dano_moral',
motivo %in% make_vector('danos materiais, dano material') ~ 'dano_material',
TRUE ~ 'outros'
))
I prefer the use of dplyr::case_when
because it is vectorized and ends up processing voluminous data a little faster. The syntax, however strange, is more functional than the if_else
(where you would be repeating if_else for everything that is side).
The make_vector
function is there just to make it easier to create string vectors with a cleaner syntax. In a perfect world these vectors that served to put the motifs in the same classes would come out of an analysis of yours, or something as.