summarizing
Some separate steps are required for this. The DataFrame object is made so that the call to a method returns a new modified dataframe, and you can now concatenate the next operation directly. Then, to filter all rows in which columns "x_ *" have more than 3 True
and get the total number, just do:
In [98]: (df.filter(like='x_').sum(axis=1) >= 3).sum()
Out[98]: 2
Let's go by parts
The first thing is to select an subdataframe with the desired columns.
Pandas has the method filter
that allows this - only the columns containing the text passed in the like
argument are selected:
In [91]: filtered_df = df.filter(like='x_')
In [92]: filtered_df
Out[92]:
x_a x_b x_c x_d
0 True False True True
1 False True True False
2 False True True True
(If Pandas did not have this, the path would be to use pure Python to select the names of the desired columns
...
data_columns = [col_name for col_name in df.columns if col_name.startswith("x_")]
And then method loc
of the dataframe, which accepts the names of a "select all", leaving the value :
:
filtered_df = df.loc[:, data_columns]
)
At this point you have only the columns that interest you, and we can do the count -
x_a x_b x_c x_d
0 True False True True
1 False True True False
2 False True True True
Here we can abuse a Python characteristic - the False
and True
values are a subclass of integers, and can participate in a sum as if they were 0 and 1 respectively. So the% method of the dataframe itself can give the sum value of each row of the table (we just need to indicate that we want the sum of the rows passing sum
, otherwise the axis=1
results in the sum of the values in each column :
In [93]: count_df = filtered_df.sum(axis=1)
In [94]: count_df
Out[94]:
0 3
1 2
2 3
dtype: int64
(If the value to be found was not True, or the desired one was not just to count the occurrences, instead of sum
, we would use .sum
- that allows to pass a generic function that will receive each line of the dataframe (or each column if axis == 0), and generate a result.)
Finally, to know how many of these lines has a value above 3 - we apply the .apply
operator. pandas redefines all binary operators - whether arithmetic or comparison, to create a new dataframe, with the result of the operation in each cell - that is:
In [95]: count_df >= 3
Out[95]:
0 True
1 False
2 True
dtype: bool
Then just repeat the >= 3
, this time letting it add the sum
to the column:
In [95]: (count_df >= 3).sum()
Out[95]: 2