Good evening guys, I would like a help from you, as I'm starting in R, and I have a demand, where I have to signal the lines where you have the similar phrases. For this, I am using the stringdist library. However, I can only make the comparison by words in the same position, and I would like to know the similarity of the whole sentence, regardless of the position of the words. For example, in the result below, the 3rd line is the same sentence, except that the words are in different positions. I should consider that phrase to be similar.
vet1 vet2 vet3
heber dos Santos araujo 0.0000000 0.0000000 0.3591486
heber dos Santos araujo 0.0000000 0.0000000 0.3591486
araujo Santos dos heber 0.3591486 0.3591486 0.0000000
heber dos s araujo 0.1372786 0.1372786 0.3955314
The code I'm using is:
library(stringdist)
library(dplyr)
dis<-read.csv2("C:/Users/heber.araujo/Desktop/Estudo Questões Duplicadas/exemploTeste.csv")
library(tm)
stp<-stopwords("portuguese") #'Lista de palavras comuns que ele retira'
dis$Nome<-as.character(dis$Nome) # Coluna para pesquisa
dis$Nome<-removeWords(dis$Nome,stp)
'#for(i in 1:nrow(dis)){
'# dis_2<-strsplit(dis$text[i]," ") # esse comando quebra a frase por palavra
'# dis_3<-unlist(dis_2)
'#dis_3<-dis$GQUE_DS_ENUNCIADO
dis_3<-dis$Nome
res<-stringdistmatrix(dis_3,dis_3,method = "jw")
rownames(res)<-dis_3