Delete part of text with Python

2

How do I delete part of a text in Python?

I have the following string

 """ 
    texto textinho 
    outro texto

    <div dir 'ltr'><div><div> bla bla ....

 """

I want to delete all HTML part.

I'm using Python2.7

    
asked by anonymous 08.08.2014 / 22:55

3 answers

1

You can use a regular expression to erase everything between the markers "<" and ">" at once.

>>> string = """ 
...     texto textinho 
...     outro texto
... 
...     <div dir 'ltr'><div><div> bla bla ....
... 
...  """
>>> 
>>> import re
>>> print re.sub(r"<.+?>", "", string)

    texto textinho 
    outro texto

     bla bla ....

Notice in particular the subsitution by "" - empty string - and the use of ? in the regular expression, which causes it to stop finding the first tag in the first tag ( > ) - otherwise the expression it would take all the text from the opening of the first tag, until closing the last one.

    
09.08.2014 / 15:45
0

I have resolved with the following REGEX

import re  
m = re.findall("[<][\w|\W]*[>]*", str(corpo), re.IGNORECASE)

for i in m:
    corpo = corpo.replace(i, "")

That deletes EVERYTHING you have:

    <QUALQUER_COISA> ISSO TAMBÉM <OPA ISSO TAMBÉM> 

Thanks for the help.

    
11.08.2014 / 19:15
-1

I did it that way

import re

string = """ 
    texto textinho 
    outro texto

    <div dir 'ltr'><div><div> bla bla ....

 """

r = re.search("[<].*[>]",string)

# retorna "<div dir 'ltr'><div><div>"
r.group(0)

result = string.replace(r.group(0),"")

#result vai conter ' \n    texto textinho \n    outro texto\n\n     bla bla ....\n\n '
    
08.08.2014 / 23:06