Diff changing rows

3

I'm comparing two files, which are updated daily, with the diff -y command in order to get two results:

The first are the lines that have been modified from one day to the next:

grupoAzul;Gabriel;04-maçãs;02-limões       |    grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões             grupoAzul;Amanda;03-maçãs;05-limões

For this, I use the command diff -y arquivoAntigo.csv arquivoNovo.csv | grep -e "|"

The second is the new lines:

grupoAzul;Gabriel;04-maçãs;02-limões       |    grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões             grupoAzul;Amanda;03-maçãs;05-limões
                                           >    grupoAzul;Kratos;04-maçãs;00-limões

For this result the diff -y arquivoAntigo.csv arquivoNovo.csv | grep -e">" command is used.

Explain this, let's go to the error

When a new line appears above the modified line, diff 'pushes' the modified line down and considers it as the new line and what it was to be the new line it considers as modified line.

grupoAzul;Gabriel;04-maçãs;02-limões       |    grupoAzul;Kratos;04-maçãs;00-limões
                                           >    grupoAzul;Gabriel;05-maçãs;02-limões
grupoAzul;Amanda;03-maçãs;05-limões             grupoAzul;Amanda;03-maçãs;05-limões

These events are, in fact, rare to happen but when they happen I have more than one line impaired.

What causes this bug and how can I fix it?

    
asked by anonymous 31.07.2018 / 17:58

1 answer

1

The problem is caused because the equal records do not appear on the same line in both files. Diff compares files line by line . In the example problem you have shown, line 2 of the left file is different from line 2 of the right file, so it should be marked with ">".

To avoid this circumstance, use sort so that all matching records appear on the same line in both files:

$ diff -y <(sort arquivoAntigo.csv) <(sort arquivoNovo.csv)
                                          <
grupoAzul;Amanda;03-maçãs;05-limões         grupoAzul;Amanda;03-maçãs;05-limões
grupoAzul;Gabriel;04-maçãs;02-limões      | grupoAzul;Gabriel;05-maçãs;02-limões
                                          > grupoAzul;Kratos;04-maçãs;00-limões

However, as you can see, whitespace in the first file gets first place in the sort algorithm, so I suggest removing white lines using sed :

$ diff -y <(sort arquivoAntigo.csv | sed '/^\s*$/d') <(sort arquivoNovo.csv | sed '/^\s*$/d')
grupoAzul;Amanda;03-maçãs;05-limões         grupoAzul;Amanda;03-maçãs;05-limões
grupoAzul;Gabriel;04-maçãs;02-limões      | grupoAzul;Gabriel;05-maçãs;02-limões
                                          > grupoAzul;Kratos;04-maçãs;00-limões

The regular expression used in sed ( /^\s*$/ ) searches for all rows that contain zero or more blank characters, such as spaces and tabs, and excludes them with the command d .

In time, the notation <( ... ) , in bash is used for the command enclosed in the parentheses to be executed previously in a subshell . Therefore, when running diff above, sort ... | sed ... is executed and return temporary files already treated for comparison via diff .

To see it working online in tutorialspoint, with the caveat that it does not seem possible to create files there, then I had to use variables to "simulate" them:

    
01.08.2018 / 04:01