Let's say I have the following text vector ( character
):
d <- data.frame(id=1:3,
txt=c('Um gato e um cachorro',
'Cachorros jogam bola usando alpargatas',
'gatinhos cospem bolas de pêlos'), stringsAsFactors=F)
I would like to add a Boolean column in% with% that is% of% if the text contains (cat or dog) and ball .
One alternative I would have would be to create a column for each of these expressions, and then do a logical operation. Using the packages d
and TRUE
(note that I do not know much about regex and therefore they got big, ugly and inefficient, but that's not important):
library(dplyr)
library(stringr)
d %>%
mutate(gato=str_detect(txt, ignore.case('^gat[aio]| gat[aio]')),
cachorro=str_detect(txt, ignore.case('cachor')),
bola=str_detect(txt, ignore.case('bola')),
result=(gato | cachorro) & bola)
Result:
id txt gato cachorro bola result
1 1 Um gato e um cachorro TRUE TRUE FALSE FALSE
2 2 Cachorros jogam bola usando alpargatas FALSE TRUE TRUE TRUE
3 3 gatinhos cospem bolas de pêlos TRUE FALSE TRUE TRUE
Now generalizing the question: let's say I have a set of dplyr
regular expressions to be applied in the text vector of stringr
size, and I want to create a boolean column that is the result of a logical operation from the detection of these expressions in the texts.
I ask: is there a way to resolve this without having to evaluate the text p
times? That is, can I decrease the number of times I apply n
to my text?
The reason for the question is because i) both my p
and my str_detect
are too large and ii) I did not want to explicitly write a bunch of boolean variables.
A response compatible with n
would be optimal, but it is not required. Thanks for any input!