Compare fields in two data sets


Consider two sets of data read from files of type *.CSV with Pandas . Each set has only one CPF Favorecido field, which contains millions of records. Each set of data is equal to one month. I need to find out what records (CPF numbers) are in one dataset but not another.

The code looks like this:

atual = pandas.read_csv(arquivo_atual, header=0, delimiter='\t', quotechar='"', usecols=['CPF Favorecido'])  
seguinte = pandas.read_csv(arquivo_seguinte, header=0, delimiter='\t', quotechar='"', usecols=['CPF Favorecido'])

I need only the count of the CPFs that appear in the atual file but are not in the seguinte file and vice versa.

Is there a function that counts these records? Or do I need to build a loop and do the comparison one by one?

asked by anonymous 24.05.2016 / 19:11

1 answer


The way I know it, using pandas, would look like this:

atual.where(~atual['CPF Favorecido'].isin(seguinte['CPF Favorecido'])).count()
seguinte.where(~seguinte['CPF Favorecido'].isin(atual['CPF Favorecido'])).count()
24.05.2016 / 21:46