Compare all rows and columns of two Dfs, update and sum the differences at the end of rows and columns

1

I have two csv separated by tab. Both have the same number of rows and columns. The first column POS has the same unique values in both dfs. The differences (or not) occur in the values (strings) of columns col1:col4 .

I thought of doing a query, something like: query = subset_pl(subset_pl.isin(subset_ad)) and from that continue with the code, but I got stuck there ...

import pandas as pd

subset_ad = pd.read_csv('subset_ad.csv', sep='\t')
subset_ad.set_index('POS')
subset_ad

POS         col1    col2    col3    col4
28355991    A       A       A       A              
28356037    A       A       A       A              
28356130    A       A       A       A              
28356246    A       A       A       A   



subset_pl = pd.read_csv('subset_pl.csv', sep='\t')
subset_pl.set_index('POS')
subset_pl

POS         col1    col2    col3    col4
28355991    A       B       A       A            
28356037    B       B       B       A              
28356130    A       B       A       A              
28356246    A       A       B       A            

What I want to achieve is: compare subset_ad with subset_pl , update subset_ad with value of subset_pl keeping subset_ad separated by , (A, B for example), if there are different values and count these differences in both rows and columns by adding a row and a plus column ( cont_col , cont_row ) to display the count of cells that have changed ...

The output would be something like:

subset_ad

POS         col1    col2    col3    col4    cont_row
28355991    A       A,B     A       A       1       
28356037    A,B     A,B     A,B     A       3       
28356130    A       A,B     A       A       1       
28356246    A       A       A,B     A       1       
cont_col    1       3       2       0      

Any directions will be welcome!

    
asked by anonymous 08.03.2017 / 14:40

1 answer

2

You can do this:

  • Read both files in different DataFrame's.
  • Concatenate the DataFrames directly, using the + operator. You can include a string with the comma in the middle. This operator will execute for each "cell" of the table.
  • Iterate over the rows and columns accounting for the differences. Where there are no differences (and if desired), merge the letters into one.
  • Code sample:

    import pandas as pd
    
    # Lê o primeiro arquivo
    subset_ad = pd.read_csv('subset_ad.csv', sep='\t', index_col=False)
    subset_ad = subset_ad.set_index('POS')
    
    print('subset_ad:')
    print('-' * 20)
    print(subset_ad)
    print('-' * 20)
    
    # Lê o segundo arquivo
    subset_pl = pd.read_csv('subset_pl.csv', sep='\t', index_col=False)
    subset_pl = subset_pl.set_index('POS')
    
    print('\n')
    print('subset_pl:')
    print('-' * 20)
    print(subset_pl)
    print('-' * 20)
    
    # Concatena os arquivos separando o conteúdo por uma vírgula
    df = subset_ad + ',' + subset_pl
    
    # Adiciona a coluna e a linha de somatório
    df['cont_row'] = [0 for _ in range(len(df))]
    df.loc['cont_col'] = [0 for _ in df.columns[:-1]] + ['']
    
    print('\n')
    print('subset_ad + subset_pl:')
    print('-' * 20)
    print(df)
    print('-' * 20)
    
    # Contabiliza as diferenças (unificando a letra onde não houver diferença)
    for index, row in df.iterrows():
        if index != 'cont_col':
            for col in df.columns[:-1]:
                val = row[col]
                letters = val.split(',')
                if letters[0] == letters[1]:
                    df.set_value(index, col, letters[0])
                else:
                    cnt = df.loc[index]['cont_row']
                    df.set_value(index, 'cont_row', cnt + 1)
    
                    cnt = df.loc['cont_col'][col]
                    df.set_value('cont_col', col, cnt + 1)
    
    print('\n')
    print('resultado final:')
    print('-' * 20)
    print(df)
    print('-' * 20)
    

    Result of this code:

    subset_ad:
    --------------------
             col1 col2 col3 col4
    POS
    28355991    A    A    A    A
    28356037    A    A    A    A
    28356130    A    A    A    A
    28356246    A    A    A    A
    --------------------
    
    
    subset_pl:
    --------------------
             col1 col2 col3 col4
    POS
    28355991    A    B    A    A
    28356037    B    B    B    A
    28356130    A    B    A    A
    28356246    A    A    B    A
    --------------------
    
    
    subset_ad + subset_pl:
    --------------------
             col1 col2 col3 col4 cont_row
    POS
    28355991  A,A  A,B  A,A  A,A        0
    28356037  A,B  A,B  A,B  A,A        0
    28356130  A,A  A,B  A,A  A,A        0
    28356246  A,A  A,A  A,B  A,A        0
    cont_col    0    0    0    0
    --------------------
    
    
    resultado final:
    --------------------
             col1 col2 col3 col4 cont_row
    POS
    28355991    A  A,B    A    A        1
    28356037  A,B  A,B  A,B    A        3
    28356130    A  A,B    A    A        1
    28356246    A    A  A,B    A        1
    cont_col    1    3    2    0
    --------------------
    
        
    09.03.2017 / 03:43