How to filter, select and count data in a pandas.DataFrame?

5

How to get multiple records based on multiple columns of a given name?

My dataframe looks something like this:

import pandas as pd

df = pd.DataFrame([["1111", True, True, False, True, True],
                   ["2222", True, False, True, True, False],
                   ["3333", True, False, True, True, True]],
                  columns=["id", "coluna_qualquer", "x_a", "x_b", "x_c", "x_d"])

I want the number of rows in which I have at least three columns with the value True , but considering only the columns that start with " x _ ", and do not consider the value of other columns (such as " column_any "). In this example, the lines with IDs "1111" and "3333" would respect this condition, that is, the return I want would be 2.

How to do this using pandas?

    
asked by anonymous 24.09.2018 / 19:01

1 answer

3

summarizing

Some separate steps are required for this. The DataFrame object is made so that the call to a method returns a new modified dataframe, and you can now concatenate the next operation directly. Then, to filter all rows in which columns "x_ *" have more than 3 True and get the total number, just do:

In [98]: (df.filter(like='x_').sum(axis=1) >= 3).sum()
Out[98]: 2

Let's go by parts

The first thing is to select an subdataframe with the desired columns. Pandas has the method filter that allows this - only the columns containing the text passed in the like argument are selected:

In [91]: filtered_df = df.filter(like='x_')

In [92]: filtered_df
Out[92]: 
     x_a    x_b   x_c    x_d
0   True  False  True   True
1  False   True  True  False
2  False   True  True   True

(If Pandas did not have this, the path would be to use pure Python to select the names of the desired columns

...
data_columns = [col_name for col_name in df.columns if col_name.startswith("x_")]

And then method loc of the dataframe, which accepts the names of a "select all", leaving the value : :

filtered_df = df.loc[:, data_columns]

)

At this point you have only the columns that interest you, and we can do the count -

     x_a    x_b   x_c    x_d
0   True  False  True   True
1  False   True  True  False
2  False   True  True   True

Here we can abuse a Python characteristic - the False and True values are a subclass of integers, and can participate in a sum as if they were 0 and 1 respectively. So the% method of the dataframe itself can give the sum value of each row of the table (we just need to indicate that we want the sum of the rows passing sum , otherwise the axis=1 results in the sum of the values in each column :

In [93]: count_df = filtered_df.sum(axis=1)

In [94]: count_df
Out[94]: 
0    3
1    2
2    3
dtype: int64

(If the value to be found was not True, or the desired one was not just to count the occurrences, instead of sum , we would use .sum - that allows to pass a generic function that will receive each line of the dataframe (or each column if axis == 0), and generate a result.)

Finally, to know how many of these lines has a value above 3 - we apply the .apply operator. pandas redefines all binary operators - whether arithmetic or comparison, to create a new dataframe, with the result of the operation in each cell - that is:

In [95]: count_df >= 3
Out[95]: 
0     True
1    False
2     True
dtype: bool

Then just repeat the >= 3 , this time letting it add the sum to the column:

In [95]: (count_df >= 3).sum()
Out[95]: 2
    
24.09.2018 / 21:27