How to delete an entry from a Python file without having to read the whole file?

0

I have a file with the following entries:

Ana
Joao
Pedro
José
....

I need to delete the line with the name Pedro, it would be easy for me to read the whole file, save it in a list, delete Pedro and rewrite the file:

nomes = open('nomes.txt','r').readlines()
del_pedro(nomes)
open('nomes.txt','w').write(nomes)

But this file has a huge dimension, and speed is something that is pretty essential in this task. Is there a way to read the whole file and when I find the input I want, just delete the line and continue reading the file?:

nomes = open('nomes.txt','r')
for i in nomes:
    if(i == 'Pedro\n'):
        deleta(i)
    
asked by anonymous 04.01.2018 / 14:31

1 answer

5

Yes - you need to read the whole file, change what you want in memory, and save it again.

This is best practice.

The main reason is that it is an unstructured text file: that is, each line has a length in bytes - and the normal use of files allowed by the operating system itself does not provide that you can change the size of a small piece of the file - it only provides that you can re-write a few bytes, but they would have to be the same size.

So, technically it would be possible to make your program write a space or "*" for each letter you wanted to delete in the original file, but the performance for that is not the best possible, and the least number of bytes that can be read or write to a file in general is 4096, anyway.

That is: you would make a complex, error-prone program that will cause your data to be lost if the program is interrupted during execution (for a system shutdown, or another failure), and even on the side of the Python was changing only 10 or 15 bytes, the IO for the disk would be 4096 bytes anyway.

You say "very large file" but unless your file has a lot more than 100,000 names in this style (about 1MB), and you are doing a lot of such operations per minute, that impact would be imperceptible in operation. p>

On the other hand, it is true that a text file with thousands of names in sequence is a very inefficient data structure: long before you get to that point, you should be using a suitable mechanism to keep data efficiently - especially if the data is critical (and more so if performance is important).

The Python language comes with the sqlite database already ready for use, and it has an efficiency that is comparable to the big names of databases like PostgreSQL and Oracle for access from a single process - manage some data like a list of names and other data associated with sqlite can give you a performance gain of 1000 to 50 thousand times compared to keeping the data in a simple .txt file.

With the tips you gave in the comments, and "guessing" the code you have there, you can do the following:

def processa(arquivo_texto):
    with(open(arquivo_texto) as arquivo:
        nomes_para_remover = set()
        try:
             funcao_que_consulta_o_mongo(linha)
        except Exception as error:
             # use algum mecanismo de logging - pode ser um print mesmo
             nomes_pra_remover.add(linha)
    limpar_arquivo_texto(arquivo_texto, nomes_pra_remover)

def limpar_arquivo_texto(arquivo_texto, nomes_pra_remover):
     nome_novo =  arquivo_texto + "_novo"
     with open(arquivo_texto) as entrada, open(nome_novo, "wt") as saida:
         for linha in entrada:
              if linha not in nomes_pra_remover:
                   saida.write(linha)
     os.remove(arquivo_texto)
     os.rename(nome_novo, arquivo_texto)

This solution removes all names that do not serve you and does this by reading and writing the whole file only once, not once for each name. On a regular PC, even with a file in the 10MB range (~ 1000000 names), you should perform the task in less than 1 second - playing with the names of the files in the specific function ensures that even if the execution is interrupted you do not lose your data: at all times you have the original file, except when it is deleted and the new file renamed to the original name.

    
04.01.2018 / 17:49