In this case, you need a regular expression that matches the entire sentence, not just the word you want. What is a phrase?
- Something that does not have
.
, ?
nor !
e:
- Something that ends with
.
, ?
or !
.
Then the regular expression that looks for a [any] phrase is:
[^.!?]*[.!?]
And to find a phrase containing the word "batman" you would use:
[^.!?]*?(batman)[^.!?]*[.!?]
The parentheses around "batman" form a capture group - so you know later on where of the phrase the word found appeared. To do this, simply pass parameters to start
and end
the number of the group that interests you ( 1
)
for x in words:
for m in re.finditer('[^.!?]*?(' + x + ')[^.!?]*[.!?]', text):
print '%02d-%02d: %s' % (m.start(1), m.end(1), m.group(0))
Output:
07-13: olha o batman.
22-28: eu sou batman.
33-36: nao sei.
40-43: eu sei.
Note: If you want the starting and ending position of the word in relation to the phrase (and not in relation to the whole string) then subtract the position of the catch match integer:
print '%02d-%02d: %s' % (m.start(1)-m.start(), m.end(1)-m.start(), m.group(0))
Output:
07-13: olha o batman.
08-14: eu sou batman.
04-07: nao sei.
03-06: eu sei.