Import CSV to bank Pandas without converting string to tuple

1

I carry a CSV file with more than 3 million lines and about 770 Mb , I use the pandas and I need to convert a column that is in string format. Below the 'lbBins' column, which when read from CSV came in string format (what is the best default for saving the data in CSV?), And the columns: lnBin1 to lbBin5 resulting from the reshapeBin function below.

tempFrame[['lnBins','lnBin1', 'lnBin21, 'lnBin3', 'lnBin4', 'lnBin5']].tail(2)

2445169 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 1, 0, 1)
2445170 (0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, ...   (0, 1, 1, 0, 0) (0, 1, 0, 1, 1) (1, 1, 0, 0, 0) (1, 1, 1, 1, 1) (0, 1, 0, 1, 1)

As you can see in the 'reshapeBin' function, you need to perform several functions:

eval()
np.array()
.reshape(5,5)
[num]
.tolist()
tuple()

I use eval () to convert the table row, converting from string to tuple, then convert to array and reshape, get line by array array in [num], convert to list, and then convert to tuple save to the table, so you can save the table in the CSV again.

Function, but I think I can improve something else to be faster processing:

def reshapeBin(x, num):
    return tuple(np.array(eval(x)).reshape(5,5)[num].tolist())

for n in range(0,5):
    tempFrame['lnBin'+str(n+1)]=tempFrame['lnBins'].apply(reshapeBin, num=n)
    print('finalizei o ', n)

Probably the way I'm saving from pandas to csv is not the best option, at least the format of the data: in the tuple table and for the csv in string, and vice versa.

    
asked by anonymous 08.03.2018 / 19:24

1 answer

0

I could not quite understand your problem - at least one entire array line would help - and understand whether you want to optimize CSV encoding - or just want the final reading as well.

In questions like help you have access to at least a valid sample of the data you want to transform - you just put two lines of text output (summarized by Pandas) - and anyone wanting to help you has to wonder how it was the input: it's not exactly the kind of thing that helps.

However, I am writing the answer, because you have something very striking about it: you do the whole process with the 25 numbers in your array if you repeat 5 times in each row, when calling 5 times the "reshapeBin" function.

"eval" is also far from ideal to use. (If people outside your team have access to write in your data, any Python program can be placed in the middle of them and run on your server - but it is not the most performative thing about world).

However, just considering the few lines you put there, you can avoid having to turn each line the same way 5 times doing something like this:

...
from ast import literal_eval

tempFrame["data"] = tempFrame["lnBins"].apply(lambda x: np.array(literal_eval(x), dtype="u1")))

for i in range(5):
    tempFrame[f"lnBin{i+1}"] = tempFrame["data"].apply(lambda x: x[i * 5: i *5 + 5])

del tempFrame["data"]
Basically, instead of calling the expensive "eval" 5 times for each line, I create a temporary column with the decoded string for numbers (and to preserve memory, instead of tuple, we put there an np.array). Then follow the same idea of using the 5 applys in a row (note the current way of creating the column name), but each time taking only a slice of the sequence of 25 numbers (without reshape). I also use ast.literal_eval which is less problematic than eval.

    
08.03.2018 / 21:09