Search for sequence of disordered words from a list in a text

3

Is there any way to have an unordered list of words and search for a string of them in a text?

Example:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

Match - > "The day is beautiful"

I can find all the words in the list, but they are not sorted

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in lista:
    if palavras in texto:
        frase.append(palavras)

print (' '.join(frase))

Output:

  

The day is late and the

Even "a" is appearing I do not know why!

    
asked by anonymous 18.10.2017 / 14:30

1 answer

3
  

Even "a" is appearing I do not know why!

The code as is passed in every word of lista and see if it exists in the text. And it does not have to exist as a loose word, it just exists in the middle and that's why a appears:

texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."
#o 'a' está aqui---^---

The in of Python operator in this case checks to see if the text is in question.

For your purpose, simply reverse the logic of for by scrolling through the text word by word and checking to see if it exists in the list. This not only solves the problem of a as well as guarantees the order:

lista = ["dia", "noite", "tarde", "é", "está", "bonito", "o", "a", "muito", "feio"]
texto = "Hoje é sábado, vamos sair pois o dia está bonito. Até mais tarde."

frase = []
for palavras in texto.split(' '): #agora texto e com split(' ') para ser palavras
    if palavras in lista: #para cada palavra agora verifica se existe na lista
        frase.append(palavras)

print (' '.join(frase))

See the Ideone example

Note that splitting words with spaces will catch words with the characters as . and , , getting words like bonito. or tarde. , causing the code to not find them

You can work around this problem in many ways. One of the simplest is to remove these templates before analyzing:

texto2 = texto.replace('.','').replace(',','');

See Ideone on how to get this analyzed

You can even do something more generic and create a list of punctuation marks to remove and remove through a custom function:

def retirar(texto, careteres):
    for c in careteres:
        texto = texto.replace(c, '')

    return texto

And now use this function over the original text:

texto2 = retirar(texto, ".,");

See this example on Ideone

    
18.10.2017 / 14:54