Identify a numerical sequence in a text file

1

I'm new to Python, and with a problem I'm not finding a solution to. I have a folder with about 10k of .txt (written in the most varied ways). I need to extract the FIRST sequence of 17 numbers that is located in the first lines of these txt's, and rename the file with the extracted sequence.

This sequence sometimes appears concatenated other times it appears separated by period and hyphen (ex: 00273200844202003, 00588.2007.011.02.00-9) PS: There are other numeric sequences in the text different or equal to 17 numbers, but the sequence is always the first of 17 that appears.

I stored the current names of the documents in a list, I was trying to find the sequence of numbers in the text using the NLTK package but without success.

pasta_de_documentos = (r'''C:\Users\mateus.ferreira\Desktop\Estudos\Python\Doc_Classifier\TXT''')
documentos = os.listdir(pasta_de_documentos)

If someone knows a better approach or can give me a way to continue attacking the problem thank you. (I'm using Python 3)

    
asked by anonymous 16.11.2017 / 12:20

3 answers

2

One solution is to get the value through a regular expression. To satisfy both possibilities, you can set the presence of points and hyphens between the digits as optional. It would look something like:

r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)'

The prefix r defines the string as raw. The parentheses create a catch group for the regular expression and characterize this group as:

  • 5-digit sequence;
  • Whether or not it can be followed by a period;
  • 4-digit sequence;
  • Whether or not it can be followed by a period;
  • 3-digit sequence;
  • Whether or not it can be followed by a period;
  • 2-digit sequence;
  • Whether or not it can be followed by a period;
  • 2-digit sequence;
  • It may or may not be followed by a hyphen;
  • 1-digit sequence;

With Python, you can use the re module to handle the contents of the file along with the regular expression:

import re

with open('data.txt') as content:
    search = re.search(r'(\d{5}\.?\d{4}\.?\d{3}\.?\d{2}\.?\d{2}\-?\d)', content.read())
    if search is not None:
        print(search.group(0))

See working at Repl.it

Thus, the value of search.group(0) will be the first 17-digit value, with separators or not, found in the data.txt file. If you have multiple files, you simply go through all of them and execute the same logic. Take advantage of and read about the glob module, which might be useful to you.

    
16.11.2017 / 13:43
1

You can use regular expressions for this.

A regular expression that finds all sequences that can contain digits, "-" and ".", with at least 17 elements - it would be possible to refine the expression until it finds 17 digits itself, but I think it is complex too - so I'd rather combine the regular expression with some logic in Python.

Because the files are small (10Kb, but even 30 times larger), you do not need some logic to read only part of the file and search there. But on the other hand you also prevent you from reading the first 4KB of each file if the string is always there (~ 400 lines if the lines are not large).

import os, re

def encontra_nome(pasta, nome_do_arquivo):
    dados = open(os.path.join(pasta, nome_do_arquivo)).read(4096)
    sequencias = re.findall(r"[0-9\.\-]{17, 35}", dados)
    for seq in sequencias:
        sequencia_limpa = re.sub("\-|\.", "", a)
        if len sequencia_limpa >= 17:
             return sequencia
    raise ValueError ("Sequencia de 17 dígitos não encontrada")

The regular expression r"[0-9\.\-]{17, 35}" looks for, as I described, any sequence between 17 and 35 character repeats between digits, "-" and ".". This allows up to one separator after each digit, so it should cover all possible formats. I preferred this instead of complicating the regular expression - why are they neither especially readable nor easy to do, to "count only the digits and ignore the other characters, and find exactly 17". A single regular expression for this would certainly be possible. Instead, once all the candidates are found, I use a linear search with a for , filter the - and. - this time with a very simple regular expression that replaces all the "-" and "." per "".

I often prefer to use two calls to the replace method of strings instead of doing so, but since we are already using regular expressions, there is no reason why we should not use one more: there are no performance barriers or something like this , but rather barriers to "ops, there comes a regular expression" of people keeping their code.

    
16.11.2017 / 13:45
1

You can use the glob module to retrieve a list containing the name of all .txt files in a given directory.

Iterating from this list, you can open each of the files, reading only the first line and extracting only the digits:

linha = entrada.readline()
digitos = (''.join(s for s in linha if s.isdigit()))

Of these read-only digits, only the 17 first would be considered and concatenated with the .txt extension:

destino = digitos[:17] + '.txt'

Once the name of the output file is mounted, you can use the shutil module to duplicate the file with the new name.

Here is an example that can solve your problem:

import shutil
import glob


# Recupera listagem de todos os arquivos .txt de um dado diretorio...
lista_arquivos = glob.glob('/tmp/teste/*.txt')

# Para cada arquivo na lista
for origem in lista_arquivos:

    # Abre arquivo de origem para leitura em modo texto
    with open( origem ) as entrada:

        # Le apenas a primeira linha do arquivo de origem
        linha = entrada.readline()

        # Extrai somente os digitos da linha lida
        digitos = (''.join(s for s in linha if s.isdigit()))

        # Formata o nome do arquivo de destino
        destino = digitos[:17] + '.txt'

    # Exibe status do processamento
    print("{} -> {}".format( origem, destino ))

    # Copia arquivo de origem para o destino
    shutil.copyfile( origem, destino );
    
16.11.2017 / 14:25