How to create a Stopwords using R

5

Hi,

I need to do a task and I'm not getting into logical reasoning.

My scenario is: I have a DF with several columns, I need to "read column 3", identify the words and sort.

Example:

DF 

nome      rua    funcao
alberto   assis  programador
elisa     cons   enfermeira
pedro     assis  prog.

I want to "read column 3" and whenever I find "programmer | prog" or similar, in a new column "sort" put "Python", the DF would look like this.

DF

nome      rua    funcao        classificacao
alberto   assis  programador   Python
elisa     cons   enfermeira    outros
pedro     assis  programador.  Python

Can you tell me if creating a stopWords is the best way to solve?

    
asked by anonymous 28.09.2018 / 19:07

2 answers

4

One way to do this is probably not the most efficient one.

dataset = read.table(text = 'nome      rua    funcao
                             alberto   assis  programador
                             elisa     cons   enfermeira
                             pedro     assis  prog.', header = T)


palavras_similares = c("prog.", "Prog", "programador", "Programador", "programador.", "Programador.")

#Posição das palavras encontradas
indice = match(palavras_similares, dataset$funcao, nomatch = 0)

#Vetor auxiliar
classificacao = rep("outros", nrow(dataset))

#Substituindo na posição das palavras encontradas
classificacao[indice] = "Python"

#Atribuindo o vetor ao Dataframe
dataset$classificacao = classificacao

dataset
#     nome   rua      funcao classificacao
#1 alberto assis programador        Python
#2   elisa  cons  enfermeira        Outros
#3   pedro assis       prog.        Python

Another way using the package dplyr

library(dplyr)
(dataset <- mutate(dataset, classificacao = ifelse(dataset$funcao %in% palavras_similares, "Python", "Outros")))

#     nome   rua      funcao classificacao
#1 alberto assis programador        Python
#2   elisa  cons  enfermeira        Outros
#3   pedro assis       prog.        Python
    
28.09.2018 / 20:38
6

To complement the response from @Thiago Fernandes, you can find similar patterns using the grep() function:

dataset[grep('prog', dataset$funcao), 'funcao']
# [1] programador prog.

The function grep() returns the position of the elements while grepl() returns a TRUE or FALSE :

dplyr::mutate(dataset, classificacao = ifelse(grepl('prog', dataset$funcao), "Python", "Outros"))
    
28.09.2018 / 21:42