Return substring defined by blocks in Python, in the first occurrence

3

In python, I'm trying to catch a substring defined by blocks, however the word "blocks" repeats in the text and would like to get the substring up to the first occurrence of it. In this example, the return causes the last occurrence:

import re
TXT = "Este é um texto de teste para verificar a captura de blocos que estão dentro de uma String. E agora inserimos outros blocos para confundir."
texto = re.search("teste.*blocos", TXT)
print(texto[0])
    
asked by anonymous 28.11.2018 / 17:45

1 answer

2

This happens because the * qualifier is greedy quantifier ) a>: it tries to get as many characters as possible that matches the expression.

To cancel the greed, just put a ? soon after * :

texto = re.search("teste.*?blocos", TXT)

This will only catch the snippet until the first occurrence of blocos .

As *? takes the minimum needed to satisfy the expression, it is called lazy quantizer quantifier ) .

Just a detail, if your string is like the example below:

TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search("teste.*?blocos", TXT)

The captured portion will be teste com cablocos . If you want only the word blocos (not cablocos ), use \b to delimit the word:

TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search(r"teste.*?\bblocos\b", TXT)

With this, the captured portion will be teste com cablocos com blocos .

I've now used r"teste..." to create a raw string , so the \ character does not need to be escaped. Without r , I would have to write it as \ :

# sem o r"..." o caractere "\" deve ser escrito como "\"
texto = re.search("teste.*?\bblocos\b", TXT)

As \ is a fairly used character in regular expressions, it is interesting to use raw strings to make the expression less confusing.

I know the correct word is "caboclos", but I could not find a better example.     

28.11.2018 / 17:51