Comparing similarity in two arrays!

1

Good evening guys, I'm having a problem, come on, maybe someone can help me.

$array1 = array();
$array2 = array();
foreach ($paper as $p)
{
    $array1[] = $p->title;
    $array2[] = $p->title;
}

$nbarray1 = count($array1);
$stringSimilarity = 0;

foreach ($array1 as $word1)
{
    $max = null;
    $similarity = null;
    foreach ($array2 as $word2)
    {
        similar_text($word1, $word2, $similarity);
        if ($similarity > $max)
        {
            $max = $similarity;
        }
    }
    $stringSimilarity += $max;
    $resultado = $stringSimilarity / $nbarray1;

    if ($resultado > 90)
    {
        echo '<b>Título 1:</b> ' . $word1 . ' <br><b>Título 2:</b> ' . $word2 . ' <b><br>Resultado: POSSIVELMENTE DUPLICADO - Porcentagem = ' . number_format((float)$resultado, 0, '.', '') . '%<br></b>';
    }
    else
    {
        echo '<b>Título 1:</b> ' . $word1 . ' <br><b>Título 2:</b> ' . $word2 . ' <b><br>Resultado: NÃO DUPLICADO - Porcentagem = ' . number_format((float)$resultado, 0, '.', '') . '%<br></b>';
    }

}

This code has the following OUTPUT

Título 1: A new method for SSD black-box performance test 
Título 2: Novel Solution for the Built-in Gate Oxide Stress Test of LDMOS in Integrated Circuits for Automotive Applications 
Resultado: NÃO DUPLICADO - Porcentagem = 25%
Título 1: Structural Health Monitoring of a rotor blade during statical load test 
Título 2: Novel Solution for the Built-in Gate Oxide Stress Test of LDMOS in Integrated Circuits for Automotive Applications 
Resultado: NÃO DUPLICADO - Porcentagem = 50%
Título 1: Using TTCN-3 in Performance Test for Service Application 
Título 2: Novel Solution for the Built-in Gate Oxide Stress Test of LDMOS in Integrated Circuits for Automotive Applications 
Resultado: NÃO DUPLICADO - Porcentagem = 75%
Título 1: Novel Solution for the Built-in Gate Oxide Stress Test of LDMOS in Integrated Circuits for Automotive Applications 
Título 2: Novel Solution for the Built-in Gate Oxide Stress Test of LDMOS in Integrated Circuits for Automotive Applications 
Resultado: POSSIVELMENTE DUPLICADO - Porcentagem = 100%
  • Note that I have only 4 titles registered. How would I make the title not try to test its similarity to itself ???
  • Note that only one title tested with all others, the right should restart the loop with another test title again, and so on until all are tested with all
  • $ paper is an array of this type

    [0] = >   object (stdClass) # 97 (26) {     ["paper_id"] = >     string (1) "1"     ["title"] = >     string (47) "A new method for SSD black-box performance test"     ["author"] = >     string (6) "Q. Xie"

  • Would it be possible after I check the duplicates, update the array of paper objects? In case it has ["status"] = > and I wanted to update this to duplicate if it was found in the previous validations.

  • If someone with a lot of patience helps to think logic is already happy, I'm catching up but I'm trying to develop: D

        
    asked by anonymous 23.07.2018 / 06:19

    1 answer

    1

    This would prevent it from checking the duplicate by putting a if() within the second foreach() :

    foreach($array2 as $word2){
    //verifica se word1 é diferente de word2, se for igual ele não compara
        if($word1 != $word2){ 
            similar_text($word1, $word2, $similarity);
            if($similarity > $max){ //1)
                $max = $similarity;
            }
        }
    }
    

    You can set this start in ways that are less repetitive, for example:

    $array1 = array();
    $array2 = array();
    foreach ($paper as $p) {
       $array1[] = $p->title;
       $array2[] = $p->title;
    }
    

    Or just like this:

    foreach ($paper as $p) {
       $array1[] = $p->title;
       $array2[] = $p->title;
    }
    
        
    23.07.2018 / 06:27