Manipulation of columns with pandas

1

I'm running a regression where I have 3 parameters and a column with categories.

As sklearn does not recognize categories I transform them into dummies (I create a column for each category and fill it with 1 case belongs to the column category and zero otherwise)

from sklearn import preprocessing
myEncoder = preprocessing.OneHotEncoder()
myEncoder.fit(df_c_f[['segment_id']])
dummies = myEncoder.transform(df_c_f[['segment_id']]).toarray()

So my array that initially has n rows and 4 columns now has 3 columns + c columns of categories.

Doubt is how I can iterate my first 3 columns with all dummies so I end up with n rows and 3 * c columns.

I ran the following code to do this, but it only works for small arrays, any number a little big the code hangs

matrix = []
def itera_parametros_e_dummies(matrix1,matrix2):
    print(len(matrix1))
    if len(matrix1) != len(matrix2):
        print("matrizes de tamanhos diferentes")
    else:
        for i in range(len(matrix1)):
            matrix.append(np.dot(matrix1[i:i+1],(matrix2[i:i+1]))[0])
    return(matrix)

itera_parametros_e_dummies(log_orgc_traf,df_dummies)
    
asked by anonymous 19.05.2017 / 03:31

1 answer

2

So the first thing is about creating the dummies. Whenever you create dummies, you should drop a column from them. If there are n categories there must be n-1 columns of dummies. That's what's called Dummy Variable Trap .

The OneHotEncoder process should by nature always create the column with the same number of rows in the entire dataset. Instead of myEncoder.fit(df_c_f[['segment_id']]) use dummies = myEncoder.fit_transform(df_c_f[['segment_id']]) . Save a line.

I also do not quite understand what the reason for multiplication is and what you expect from it.

    
14.07.2017 / 05:55