Selecting different intervals in a giant dataframe in RStudio

0

I have a large CSV with large stock dates and their closing prices, impossible to use Excel.

The action name is in the same column as the date and only appears at the beginning of the series, as shown below:

I have limited knowledge of R and am needing some function to help me do this range reading.

NOTE: The action name is always enclosed in parentheses: (AÇÃO X)

    
asked by anonymous 07.12.2018 / 16:17

1 answer

2

One way to do this, would be (I do not know if it's the most efficient, but it's possible and it works):

  • Find out where your tabs are, ie, which lines have empty text values for each of the columns, and save the indexes in a list

  • Run for each index of the DataFrame list and separate the Series that you own into subseries according to the index (each subset containing an Action)

  • Reformat this linhasVazias containing the resulting subset in the new format

  • Saving in the final df, you will receive the new information

  • Here is the code where I do these operations:

    linhasVazias = df[(df['c0'] == "") & (df['c1'] == "") ].index.tolist()
    
    df_final = pd.DataFrame({'c0': [], 'c1': [], 'c2': []})
    anterior = -1
    
    for i in linhasVazias:
        # Separa a série relacionada
        temp = df[anterior+1 : i]
    
        # Cria a nova coluna com o nome da ação
        temp['c2'] = temp.iloc[0][0]
    
        # Remove a primeira linha, com o nome da ação
        temp = temp.drop([anterior+1], axis = 0)
    
        # Salva no novo dataFrame as linhas relacionadas
        df_final = df_final.append(temp)
        anterior = i
    
    # Reseta os index no novo DataFrame, excluindo a coluna dos valores antigos
    df_final = df_final.reset_index(drop = True)
    

    OBS:
    - here I used "c0", "c1" and "c2" for the name of the columns
    - for your case, which has a very large DF, I do not know if the processing will be efficient, but worth the test.

        
    07.12.2018 / 17:31