How to do data mining in a txt file with re.finditer

5

This code can tell me the location of the words batman and sei in the whole txt file:

import re
f = open('C:/pah.txt','r+')
text = f.read()    
words = ['batman','sei']
for x in words:
 for m in re.finditer(x,text):
  print '%02d-%02d: %s' % (m.start(), m.end(), m.group(0))

How do I get the returned result to include the phrase where the word is found?

The file pah.txt is this:

  Look at Batman. I am batman.I do not know.I know.But what is the reason   for me to know I do not know

And the intended result should be:

10-16 olha o batman.
    
asked by anonymous 28.03.2014 / 21:21

2 answers

3

In this case, you need a regular expression that matches the entire sentence, not just the word you want. What is a phrase?

  • Something that does not have . , ? nor ! e:
  • Something that ends with . , ? or ! .

Then the regular expression that looks for a [any] phrase is:

[^.!?]*[.!?]

And to find a phrase containing the word "batman" you would use:

[^.!?]*?(batman)[^.!?]*[.!?]

The parentheses around "batman" form a capture group - so you know later on where of the phrase the word found appeared. To do this, simply pass parameters to start and end the number of the group that interests you ( 1 )

for x in words:
    for m in re.finditer('[^.!?]*?(' + x + ')[^.!?]*[.!?]', text):
        print '%02d-%02d: %s' % (m.start(1), m.end(1), m.group(0))

Output:

07-13: olha o batman.
22-28:  eu sou batman.
33-36: nao sei.
40-43: eu sei.

Note: If you want the starting and ending position of the word in relation to the phrase (and not in relation to the whole string) then subtract the position of the catch match integer:

        print '%02d-%02d: %s' % (m.start(1)-m.start(), m.end(1)-m.start(), m.group(0))

Output:

07-13: olha o batman.
08-14:  eu sou batman.
04-07: nao sei.
03-06: eu sei.
    
28.03.2014 / 22:07
5

To get the complete sentences, you can do this:

import re

f = open('C:/pah.txt','r+')
text = f.read() 

words = ['batman','sei']

for x in words:
    sentences = [sentence for sentence in re.split('\.|\?|!', text) if x in sentence]

    for sentence in sentences:
        print sentence

The output looks like this:

olha o batman
 eu sou batman
nao sei
eu sei
Mas e qual será a razão para eu saber
nÃO sei

(I did not understand what the "10-16" positions of the example you have passed mean.)

    
28.03.2014 / 21:58