Make different combinations of words within a sentence

2

As an example I have the following sentence:

texto = O gato subiu no telhado de 10m, um homem jogou-lhe uma pedra e ele caiu de uma altura de 20m.

I want to extract the following information:

(O gato subiu 10m, O gato caiu 20m)

I tried:

(gato).*(subiu|caiu).*(?=m)

And it only returned me

gato subiu 10m .

I can use tb:

>>search_1=re.findall(re.compile('gato.*subiu.*(?=m)'),texto)

>>search_1=[gato subiu 10]

>>search_2=re.findall(re.compile('gato.*caiu.*(?=m)'),texto)

>>search_2=[gato caiu 20]

and then the two lists together.

But I still believe there should be a more optimized way to write this in just one line of code.

The sentences always respect this order [gato / palavra / número seguido de "m"]

    
asked by anonymous 12.04.2017 / 22:17

1 answer

2

Can not do with a single expression using the module re of Python
(although this is possible with the regex module, created by Matthew Barnett , using \G ).

  

Addendum (Guilherme Lautert):

     

The reason why you can not use the same regex for both   cases, is that regex is used to find / replace. And you have one   problem in this logic.

     

Take a look at the sentences you want:

     
  • O gato subiu 10m
  •   
  • O gato caiu 20m
  •   

    You want to capture O gato twice, with only one being shown.   The other being interpreted by "ele" . View on regex101 . That is    O gato has already been captured so it is not caught again.

    Use two expressions, one to marry the subject, another to marry the word and the number, starting the marriage at the end of the last.

    Subject:

    \b(gato|coelho)\b
    

    Sentence:

    [^\n.]*?\b(subiu|caiu)\b[^\n.,]*?(\d+m\b)
    
    • [^\n.]*? - negated list that matches any new off-line characters or periods (that is, in the same sentence), with a non-greedy quantifier for the smallest marriage possible.
    • \b(subiu|caiu)\b - group 1, to save to the verb.
    • [^\n.,]*? - more characters, less newlines, periods, or commas.
    • (\d+m\b) - group 2, to save to the number followed by "m".


    Code

    import re
    
    sujeito_re  = re.compile(r"\b(gato|coelho)\b", re.IGNORECASE)
    sentenca_re = re.compile(r"[^\n.]*?\b(subiu|caiu)\b[^\n.,]*?(\d+m\b)", re.IGNORECASE)
    resultado = ()
    
    texto = "O gato subiu no telhado de 10m, um homem jogou-lhe uma pedra e ele caiu de uma altura de 20m."
    
    
    for sujeito in sujeito_re.finditer(texto):
        pos = sujeito.end()
        while True:
            sentenca = sentenca_re.match(texto, pos)
            if not sentenca:
                break
            resultado += (sujeito.group(1) + " " + sentenca.group(1) + " " + sentenca.group(2),)
            pos = sentenca.end()
    
    print (resultado)
    

    Result:

    ('gato subiu 10m', 'gato caiu 20m')
    

    You can test here: link

        
    13.04.2017 / 19:11