An algorithm to find duplicate files?

4

I need to make a program that finds duplicate files on my computer, so the user can decide what action to take with these files (eg delete the copies). For now, I only worry about a binary comparison between files (that is, the file is only duplicated if it is 100% equal to another)

I know that searching only for the filename is insufficient, since the same file may have been saved under another name.

Is there an algorithm for comparing files?

I imagine that generating the checksum of all files and comparing all of them against all is unproductive because it is not normal to have so many duplicate files. I also assume you can not use just the file size. And they may have cases where the file is duplicated more than once.

    
asked by anonymous 30.01.2014 / 13:18

4 answers

3

In parts:

  • list everything with basic information: local (on disk / directory), name, date and size;
  • separate files that have the same name (exactly the same, including uppercase and lowercase);
  • likewise files that have the same size (in Bytes);
  • delete "non-repeated" (no name or equal size);
  • select the "repeated level 1" (name, size and date equal), and apply a checksum in each block, mark the REALLY equal ones;
  • select the "repeated level 2" (name or equal size and date), and apply a checksum in each block, mark the REALLY equal ones;
  • select the "repeated level 3" (same name and size with a different date), and apply a checksum in each block, mark the REALLY equal ones;
  • select the "repeated level 4" (same name or size with a different date), and apply a checksum to each block, mark the REALLY equal ones;
  • with REALLY equal, present each block to the user so that it defines which will be deleted;
  • I suggest adding a few options: that the user can access the location of each file; open in the default editor for viewing the content; you can move the "chosen / repeated" file to a specific folder.

    An option that I think is very useful, when selecting only one file (in the windows environment, for example), can be used through the context menu (right-click), so that some REPEAT file is found was selected.

    Think only of ignoring the contents of compressed folders, that is, if the duplicate file is within a ZIP / RAR, it will never be evaluated and therefore will never be considered repetitive (put this in the instructions for use of your future application). And then send me a copy to test ;-)

        
    30.01.2014 / 13:52
    1

    List all files;

    For each file, follow the steps below:

  • Generate a hash from the contents of the file and store it in a hash table;

  • In the case of a hash collision, verify that the file is equal to the     files with same Hash, byte to byte. If it's the same, you     found a duplicate.

  • 30.01.2014 / 14:36
    0

    I do not think it's possible. You would have to compare all the files to each other, and the program's runtime would grow exponentially relative to the number of files.

    You can do something that starts by enumerating the files. Then each file would have to be compared to all the others. An optimization would be to compare the size, then who knows a checksum, and then if they are still the same compare byte to byte.

    For a few files it will work fine, but as the number of files increases the execution time of the algorithm will rise rapidly to impractical scales.

        
    30.01.2014 / 13:33
    0

    You use basename function, PHP:

    $inicio = "file:///C://";    // Você poder alterar o caminho atraves das pastas.
    $arquivo = basename($inicio);    
    $file = basename($inicio, "Nome");
    
    function stribet($inputstr, $deliLeft, $deliRight) {
        $posLeft = stripos($inputstr, $deliLeft) + strlen($deliLeft);
        $posRight = stripos($inputstr, $deliRight, $posLeft);
        return substr($inputstr, $posLeft, $posRight - $posLeft);
    }
    

    Paste the content:

    $res = file_get_contents($inicio);
    

    Find:

    $x = @$this->stribet($res,'$file','[1]');
    

    You should get files with [1]:

    $d = '$this->file($x)';
    

    Function if the file has [1]:

    if ($file == "'.$d.'"){
    }
    

    It may not be accurate, or it may not work, if it does not work just talk. This function can only take file with [1] in the name.

        
    30.01.2014 / 13:40