Compare all rows and columns of two Dfs, update and sum the differences at the end of rows and columns

Question

Compare all rows and columns of two Dfs, update and sum the differences at the end of rows and columns

Navigation

#1 by (2 votes)

1

I have two csv separated by tab. Both have the same number of rows and columns. The first column POS has the same unique values in both dfs. The differences (or not) occur in the values (strings) of columns col1:col4 .

I thought of doing a query, something like: query = subset_pl(subset_pl.isin(subset_ad)) and from that continue with the code, but I got stuck there ...

import pandas as pd

subset_ad = pd.read_csv('subset_ad.csv', sep='\t')
subset_ad.set_index('POS')
subset_ad

POS         col1    col2    col3    col4
28355991    A       A       A       A              
28356037    A       A       A       A              
28356130    A       A       A       A              
28356246    A       A       A       A   



subset_pl = pd.read_csv('subset_pl.csv', sep='\t')
subset_pl.set_index('POS')
subset_pl

POS         col1    col2    col3    col4
28355991    A       B       A       A            
28356037    B       B       B       A              
28356130    A       B       A       A              
28356246    A       A       B       A

What I want to achieve is: compare subset_ad with subset_pl , update subset_ad with value of subset_pl keeping subset_ad separated by , (A, B for example), if there are different values and count these differences in both rows and columns by adding a row and a plus column ( cont_col , cont_row ) to display the count of cells that have changed ...

The output would be something like:

subset_ad

POS         col1    col2    col3    col4    cont_row
28355991    A       A,B     A       A       1       
28356037    A,B     A,B     A,B     A       3       
28356130    A       A,B     A       A       1       
28356246    A       A       A,B     A       1       
cont_col    1       3       2       0

Any directions will be welcome!

python csv pandas

asked by anonymous 08.03.2017 / 14:40

1 answer

How to stratify / divide a data.frame according to categories of a variable in R? Forearch loop how to use with controler

score 2 · Accepted Answer

You can do this:

Read both files in different DataFrame's.

Concatenate the DataFrames directly, using the + operator. You can include a string with the comma in the middle. This operator will execute for each "cell" of the table.

Iterate over the rows and columns accounting for the differences. Where there are no differences (and if desired), merge the letters into one.

Code sample:

import pandas as pd

# Lê o primeiro arquivo
subset_ad = pd.read_csv('subset_ad.csv', sep='\t', index_col=False)
subset_ad = subset_ad.set_index('POS')

print('subset_ad:')
print('-' * 20)
print(subset_ad)
print('-' * 20)

# Lê o segundo arquivo
subset_pl = pd.read_csv('subset_pl.csv', sep='\t', index_col=False)
subset_pl = subset_pl.set_index('POS')

print('\n')
print('subset_pl:')
print('-' * 20)
print(subset_pl)
print('-' * 20)

# Concatena os arquivos separando o conteúdo por uma vírgula
df = subset_ad + ',' + subset_pl

# Adiciona a coluna e a linha de somatório
df['cont_row'] = [0 for _ in range(len(df))]
df.loc['cont_col'] = [0 for _ in df.columns[:-1]] + ['']

print('\n')
print('subset_ad + subset_pl:')
print('-' * 20)
print(df)
print('-' * 20)

# Contabiliza as diferenças (unificando a letra onde não houver diferença)
for index, row in df.iterrows():
    if index != 'cont_col':
        for col in df.columns[:-1]:
            val = row[col]
            letters = val.split(',')
            if letters[0] == letters[1]:
                df.set_value(index, col, letters[0])
            else:
                cnt = df.loc[index]['cont_row']
                df.set_value(index, 'cont_row', cnt + 1)

                cnt = df.loc['cont_col'][col]
                df.set_value('cont_col', col, cnt + 1)

print('\n')
print('resultado final:')
print('-' * 20)
print(df)
print('-' * 20)

Result of this code:

subset_ad:
--------------------
         col1 col2 col3 col4
POS
28355991    A    A    A    A
28356037    A    A    A    A
28356130    A    A    A    A
28356246    A    A    A    A
--------------------


subset_pl:
--------------------
         col1 col2 col3 col4
POS
28355991    A    B    A    A
28356037    B    B    B    A
28356130    A    B    A    A
28356246    A    A    B    A
--------------------


subset_ad + subset_pl:
--------------------
         col1 col2 col3 col4 cont_row
POS
28355991  A,A  A,B  A,A  A,A        0
28356037  A,B  A,B  A,B  A,A        0
28356130  A,A  A,B  A,A  A,A        0
28356246  A,A  A,A  A,B  A,A        0
cont_col    0    0    0    0
--------------------


resultado final:
--------------------
         col1 col2 col3 col4 cont_row
POS
28355991    A  A,B    A    A        1
28356037  A,B  A,B  A,B    A        3
28356130    A  A,B    A    A        1
28356246    A    A  A,B    A        1
cont_col    1    3    2    0
--------------------