This happens because the *
qualifier is greedy quantifier ) a>: it tries to get as many characters as possible that matches the expression.
To cancel the greed, just put a ?
soon after *
:
texto = re.search("teste.*?blocos", TXT)
This will only catch the snippet until the first occurrence of blocos
.
As *?
takes the minimum needed to satisfy the expression, it is called lazy quantizer quantifier ) .
Just a detail, if your string is like the example below:
TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search("teste.*?blocos", TXT)
The captured portion will be teste com cablocos
. If you want only the word blocos
(not cablocos
), use \b
to delimit the word:
TXT = "teste com cablocos com blocos que tem mais blocos."
texto = re.search(r"teste.*?\bblocos\b", TXT)
With this, the captured portion will be teste com cablocos com blocos
.
I've now used r"teste..."
to create a raw string , so the \
character does not need to be escaped. Without r
, I would have to write it as \
:
# sem o r"..." o caractere "\" deve ser escrito como "\"
texto = re.search("teste.*?\bblocos\b", TXT)
As \
is a fairly used character in regular expressions, it is interesting to use raw strings to make the expression less confusing.
I know the correct word is "caboclos", but I could not find a better example.