Python Python too slow

2

Can anyone help me? I'm reading from a file, I make some changes and then saved to another folder. but this takes 2 hours, the file has 15 million lines, would it have some different and more effective method?

# LER ARQUIVO NA PASTA STAGING
arq5 = pd.read_csv(r'C:\Users\Usuário\staging\arquivo5.txt',delimiter='\t',encoding='cp1252',engine='python')


# FAZ ALTERAÇÕES NO ARQUIVO 
columns = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
arq5.drop(columns, inplace=True, axis=1)

# SALVA O ARQUIVO 5 COMO CSV NA PASTA ALPHA
arq5.to_csv(r'C:\Users\Usuário\alpha\arquivo5.txt', index=False)
    
asked by anonymous 17.10.2018 / 19:52

1 answer

3

The pandas loads the entire file into memory, and this can be slow for very large files.

Try not to load the entire file. The code below does the same as yours, however without using pandas and without loading the entire file into memory - it will read the source file line by line, then modifying, and saving direct to the destination:

colunas_remover = ['PERIODO', 'CRM', 'CAT', 'MERCADO', 
    'MERCADO_PX', 'CDGLABORATORIO', 'CDGPRODUTO', 'PX']
nome_arquivo = r'C:\Users\Usuário\staging\arquivo5.txt'
destino = r'C:\Users\Usuário\alpha\arquivo5.txt'

# LER ARQUIVO JA GRAVANDO O RESULTADO EM OUTRA PASTA
with open(nome_arquivo, encoding='cp1252', newline='') as f:
    cf = csv.DictReader(f, delimiter='\t')
    with open(destino, 'w', encoding='cp1252', newline='') as fw:
        colunas_manter = [c for c in cf.fieldnames if c not in colunas_remover]
        cw = csv.DictWriter(fw, colunas_manter, delimiter='\t',
            extrasaction='ignore') # ignora o que nao esta em "manter"
        cw.writeheader()
        cw.writerows(cf)
    
17.10.2018 / 22:42