Sort a vector of texts using regular expressions using R

4

Let's say I have the following text vector ( character ):

d <- data.frame(id=1:3, 
                txt=c('Um gato e um cachorro', 
                      'Cachorros jogam bola usando alpargatas', 
                      'gatinhos cospem bolas de pêlos'), stringsAsFactors=F)

I would like to add a Boolean column in% with% that is% of% if the text contains (cat or dog) and ball .

One alternative I would have would be to create a column for each of these expressions, and then do a logical operation. Using the packages d and TRUE (note that I do not know much about regex and therefore they got big, ugly and inefficient, but that's not important):

library(dplyr)
library(stringr)    
d %>%
  mutate(gato=str_detect(txt, ignore.case('^gat[aio]| gat[aio]')),
         cachorro=str_detect(txt, ignore.case('cachor')),
         bola=str_detect(txt, ignore.case('bola')),
         result=(gato | cachorro) & bola)

Result:

  id                                    txt  gato cachorro  bola result
1  1                  Um gato e um cachorro  TRUE     TRUE FALSE  FALSE
2  2 Cachorros jogam bola usando alpargatas FALSE     TRUE  TRUE   TRUE
3  3         gatinhos cospem bolas de pêlos  TRUE    FALSE  TRUE   TRUE

Now generalizing the question: let's say I have a set of dplyr regular expressions to be applied in the text vector of stringr size, and I want to create a boolean column that is the result of a logical operation from the detection of these expressions in the texts.

I ask: is there a way to resolve this without having to evaluate the text p times? That is, can I decrease the number of times I apply n to my text?

The reason for the question is because i) both my p and my str_detect are too large and ii) I did not want to explicitly write a bunch of boolean variables.

A response compatible with n would be optimal, but it is not required. Thanks for any input!

    
asked by anonymous 03.10.2014 / 21:18

1 answer

4

@JulioTrecenti, there is a yes way: include logical tests in Regex. Note that the pipe (|) is equivalent to an OR operation, but to implement the AND operation we will need the lookahead . Another point is that I will use grepl () instead of str_detect (), because in that case I simply want to evaluate a regular expression with a binary response, so grepl () already gives the message. I suppose the data.frame was created as you quoted the following code solves the problem on one line:

d %>% 
  mutate(result = grepl(d$txt, pattern = '(?=.*gat[aio]|[cC]achor)(?=.*bola)', perl = T))

  id                                    txt result
1  1                  Um gato e um cachorro  FALSE
2  2 Cachorros jogam bola usando alpargatas   TRUE
3  3         gatinhos cospem bolas de pêlos   TRUE

See that (? = something) (? = something else) says to find "something" and go back thinking "something else". This something is: something = cat OR cat, where the OU is represented by the pipe. Note also the option "perl = T" in grepl () that tells R to use the regex according to perl. Without this feature lookahead does not work.

    
03.10.2014 / 22:43