Selecting different intervals in a giant dataframe in RStudio

Question

Selecting different intervals in a giant dataframe in RStudio

Navigation

#1 by (2 votes)

0

I have a large CSV with large stock dates and their closing prices, impossible to use Excel.

The action name is in the same column as the date and only appears at the beginning of the series, as shown below:

I have limited knowledge of R and am needing some function to help me do this range reading.

NOTE: The action name is always enclosed in parentheses: (AÇÃO X)

python r big-data

asked by anonymous 07.12.2018 / 16:17

1 answer

Error generating PDF, Undefined variable: o (View: Number of rows and columns of a matrix in python

score 2 · Answer 1

One way to do this, would be (I do not know if it's the most efficient, but it's possible and it works):

Find out where your tabs are, ie, which lines have empty text values for each of the columns, and save the indexes in a list

Run for each index of the DataFrame list and separate the Series that you own into subseries according to the index (each subset containing an Action)

Reformat this linhasVazias containing the resulting subset in the new format

Saving in the final df, you will receive the new information

Here is the code where I do these operations:

linhasVazias = df[(df['c0'] == "") & (df['c1'] == "") ].index.tolist()

df_final = pd.DataFrame({'c0': [], 'c1': [], 'c2': []})
anterior = -1

for i in linhasVazias:
    # Separa a série relacionada
    temp = df[anterior+1 : i]

    # Cria a nova coluna com o nome da ação
    temp['c2'] = temp.iloc[0][0]

    # Remove a primeira linha, com o nome da ação
    temp = temp.drop([anterior+1], axis = 0)

    # Salva no novo dataFrame as linhas relacionadas
    df_final = df_final.append(temp)
    anterior = i

# Reseta os index no novo DataFrame, excluindo a coluna dos valores antigos
df_final = df_final.reset_index(drop = True)

OBS:
- here I used "c0", "c1" and "c2" for the name of the columns
- for your case, which has a very large DF, I do not know if the processing will be efficient, but worth the test.