Similarity of Texts

2

Good evening guys, I would like a help from you, as I'm starting in R, and I have a demand, where I have to signal the lines where you have the similar phrases. For this, I am using the stringdist library. However, I can only make the comparison by words in the same position, and I would like to know the similarity of the whole sentence, regardless of the position of the words. For example, in the result below, the 3rd line is the same sentence, except that the words are in different positions. I should consider that phrase to be similar.

                              vet1      vet2          vet3     
  heber dos Santos araujo   0.0000000   0.0000000   0.3591486

  heber dos Santos araujo   0.0000000   0.0000000   0.3591486

  araujo Santos dos heber   0.3591486   0.3591486   0.0000000

  heber dos s araujo    0.1372786   0.1372786   0.3955314

The code I'm using is:

library(stringdist)

library(dplyr)

dis<-read.csv2("C:/Users/heber.araujo/Desktop/Estudo Questões Duplicadas/exemploTeste.csv")

library(tm)
stp<-stopwords("portuguese") #'Lista de palavras comuns que ele retira'


dis$Nome<-as.character(dis$Nome) # Coluna para pesquisa
dis$Nome<-removeWords(dis$Nome,stp)

'#for(i in 1:nrow(dis)){  
'# dis_2<-strsplit(dis$text[i]," ")  # esse comando quebra a frase por palavra
'# dis_3<-unlist(dis_2) 

'#dis_3<-dis$GQUE_DS_ENUNCIADO

dis_3<-dis$Nome

res<-stringdistmatrix(dis_3,dis_3,method = "jw")

rownames(res)<-dis_3
    
asked by anonymous 18.01.2018 / 02:57

1 answer

4

I think the following code answers the question. First I'll read the data, since we do not have access to the file exemploTeste.csv .

Nome <- scan(what = character(), text = "
'heber dos Santos araujo'
'heber dos Santos araujo'
'araujo Santos dos heber'
'heber dos s araujo'")

Distances will now be calculated by a function, heber , which orders the names before, and then calls stringdistmatrix with the names sorted. So the differences in word order disappear.
The originals are not changed.

heber <- function(x, method = "jw"){
    y <- strsplit(x, "[[:space:]]+")
    y <- apply(sapply(y, sort), 2, paste, collapse = " ")
    stringdistmatrix(y, y, method = method)
}

Nome <- removeWords(Nome, stp)
dis_3 <- Nome

res <- heber(dis_3)
rownames(res) <- dis_3
res
#                          [,1]      [,2]      [,3]      [,4]
#heber  Santos araujo 0.0000000 0.0000000 0.0000000 0.0877193
#heber  Santos araujo 0.0000000 0.0000000 0.0000000 0.0877193
#araujo Santos  heber 0.0000000 0.0000000 0.0000000 0.0877193
#heber  s araujo      0.0877193 0.0877193 0.0877193 0.0000000
    
18.01.2018 / 10:59