I carry a CSV file with more than 3 million lines and about 770 Mb , I use the pandas and I need to convert a column that is in string format. Below the 'lbBins' column, which when read from CSV came in string format (what is the best default for saving the data in CSV?), And the columns: lnBin1 to lbBin5 resulting from the reshapeBin function below.
tempFrame[['lnBins','lnBin1', 'lnBin21, 'lnBin3', 'lnBin4', 'lnBin5']].tail(2) 2445169 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ... (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 1, 0, 1) 2445170 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ... (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 0, 1, 1)
As you can see in the 'reshapeBin' function, you need to perform several functions:
eval() np.array() .reshape(5,5) [num] .tolist() tuple()
I use eval () to convert the table row, converting from string to tuple, then convert to array and reshape, get line by array array in [num], convert to list, and then convert to tuple save to the table, so you can save the table in the CSV again.
Function, but I think I can improve something else to be faster processing:
def reshapeBin(x, num): return tuple(np.array(eval(x)).reshape(5,5)[num].tolist()) for n in range(0,5): tempFrame['lnBin'+str(n+1)]=tempFrame['lnBins'].apply(reshapeBin, num=n) print('finalizei o ', n)
Probably the way I'm saving from pandas to csv is not the best option, at least the format of the data: in the tuple table and for the csv in string, and vice versa.