Hello, I'm new to python, and I'm probably taking a very amateur approach to the problem. I'm trying to find N-digit sequences in .txt files (in example N = 17). I have thousands of files and I've been creating regular expressions for the most common types of occurrences I've noticed, for example:
01582.2005.012.02.00- 3 \ r \ n
# 01387.2009.466.02.001
# 01462. 2008. 030. 02. 000 \ r \ n
#0033620084610200-0 \ r \ n
n. 02414.2008.023.02.001 (201 ...
No. 00030.2007.084.02.00-3 (2
# 00627.2009.006.02.004
# 0001491-6020125020
number: 00028.2009.031.02.00-0
n 00012.2010.391.02.00-0 - 7ª tu
# 0000695720135020402
# 00037.2007.048.02.00-1
01113.2009.074.02.00.4. \ r
proc: - 0002396-25.2011.5.02-0020
# 0163100-53-2010.5.02.0341
# 01230.2007.065.02.0.0-5 - 7 th
# 64587.2009. \ r \ n 549.02.001
The regular expressions I created were able to find the sequences in about 70% of the files, but I got to a point where for every new expression I make, the number of sequences found is so insignificant in relation to what I need I feel counting sand. Some of the regex I used were these:
search = re.search(r'((\d{5})\.?\s*(\d{4})\.?\s*(\d{3})\.?\s*(\d{2})\.?\s*(\d{2})\-?\s*(\d))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d)\-(\d{2}))', content.read())
search = re.search(r'((\d{5})\.(\d{4})\.(\d{3})\.(\d{2})\.(\d{2})\.(\d))', content.read())
They took care of finding some of these examples I gave, but most of them did not. What I would like to know is how can I take a broader approach to my regex than I am doing. Thank you.
edit: One of the biggest problems that my regex failed to find was in cases where there is a line break in the middle of the sequence, or in cases where there are blanks between the points. Other information that may be pertinent is that the text may contain other sequences of numbers on the same line and / or other lines