I have two csv separated by tab. Both have the same number of rows and columns. The first column POS
has the same unique values in both dfs. The differences (or not) occur in the values (strings) of columns col1:col4
.
I thought of doing a query, something like: query = subset_pl(subset_pl.isin(subset_ad))
and from that continue with the code, but I got stuck there ...
import pandas as pd
subset_ad = pd.read_csv('subset_ad.csv', sep='\t')
subset_ad.set_index('POS')
subset_ad
POS col1 col2 col3 col4
28355991 A A A A
28356037 A A A A
28356130 A A A A
28356246 A A A A
subset_pl = pd.read_csv('subset_pl.csv', sep='\t')
subset_pl.set_index('POS')
subset_pl
POS col1 col2 col3 col4
28355991 A B A A
28356037 B B B A
28356130 A B A A
28356246 A A B A
What I want to achieve is: compare subset_ad
with subset_pl
, update subset_ad
with value of subset_pl
keeping subset_ad
separated by ,
(A, B for example), if there are different values and count these differences in both rows and columns by adding a row and a plus column ( cont_col
, cont_row
) to display the count of cells that have changed ...
The output would be something like:
subset_ad
POS col1 col2 col3 col4 cont_row
28355991 A A,B A A 1
28356037 A,B A,B A,B A 3
28356130 A A,B A A 1
28356246 A A A,B A 1
cont_col 1 3 2 0
Any directions will be welcome!