Regex can not find all the required expressions

2

Hello, I'm new to python, and I'm probably taking a very amateur approach to the problem. I'm trying to find N-digit sequences in .txt files (in example N = 17). I have thousands of files and I've been creating regular expressions for the most common types of occurrences I've noticed, for example:

  

01582.2005.012.02.00- 3 \ r \ n

     

# 01387.2009.466.02.001

     

# 01462. 2008. 030. 02. 000 \ r \ n

     

#0033620084610200-0 \ r \ n

     

n. 02414.2008.023.02.001 (201 ...

     

No. 00030.2007.084.02.00-3 (2

     

# 00627.2009.006.02.004

     

# 0001491-6020125020

     

number: 00028.2009.031.02.00-0

     

n 00012.2010.391.02.00-0 - 7ª tu

     

# 0000695720135020402

     

# 00037.2007.048.02.00-1

     

01113.2009.074.02.00.4. \ r

     

proc: - 0002396-25.2011.5.02-0020

     

# 0163100-53-2010.5.02.0341

     

# 01230.2007.065.02.0.0-5 - 7 th

     

# 64587.2009. \ r \ n 549.02.001

The regular expressions I created were able to find the sequences in about 70% of the files, but I got to a point where for every new expression I make, the number of sequences found is so insignificant in relation to what I need I feel counting sand. Some of the regex I used were these:

search = re.search(r'((\d{5})\.?\s*(\d{4})\.?\s*(\d{3})\.?\s*(\d{2})\.?\s*(\d{2})\-?\s*(\d))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d)\-(\d{2}))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d{2})\.(\d))', content.read())

They took care of finding some of these examples I gave, but most of them did not. What I would like to know is how can I take a broader approach to my regex than I am doing. Thank you.

edit: One of the biggest problems that my regex failed to find was in cases where there is a line break in the middle of the sequence, or in cases where there are blanks between the points. Other information that may be pertinent is that the text may contain other sequences of numbers on the same line and / or other lines

    
asked by anonymous 23.11.2017 / 20:20

1 answer

0
  

I'm trying to find N-digit sequences in .txt files (in   example N = 17).

     

One of the biggest problems my regex did not handle   found was in cases where there is a line break in the middle of the   sequence [...]

     

[...] I need only find the sequence of 17 digits, the letters are not necessary.

So depending on what you mentioned in the comments and the question is trying to find only the digits and they have non-numeric characters between them and may also contain line breaks. Why not simply capture the numbers separately and join them through a String ?

I think the following code will solve your problem.

import re
contador = 0
sequencia = ""
REGEX = r'(d)' 
pattern = re.compile(REGEX, re.UNICODE)
file = open("file.txt").read()
for match in pattern.findall(file):
    contador++
    sequencia = sequencia + match
    if (contador % 17 == 0):
        print(sequencia)
        sequencia = ""

OBS : I noticed that in your file there is a sequence "7ª tu" and I believe that you do not want to capture it, because it comes exactly after a sequence of 17 characters, in that case change the (\d) to (\d(?!ª)) .

    
23.11.2017 / 23:53