How do I delete part of a text in Python?
I have the following string
"""
texto textinho
outro texto
<div dir 'ltr'><div><div> bla bla ....
"""
I want to delete all HTML part.
I'm using Python2.7
How do I delete part of a text in Python?
I have the following string
"""
texto textinho
outro texto
<div dir 'ltr'><div><div> bla bla ....
"""
I want to delete all HTML part.
I'm using Python2.7
You can use a regular expression to erase everything between the markers "<" and ">" at once.
>>> string = """
... texto textinho
... outro texto
...
... <div dir 'ltr'><div><div> bla bla ....
...
... """
>>>
>>> import re
>>> print re.sub(r"<.+?>", "", string)
texto textinho
outro texto
bla bla ....
Notice in particular the subsitution by ""
- empty string - and the use of ?
in the regular expression, which causes it to stop finding the first tag in the first tag ( >
) - otherwise the expression it would take all the text from the opening of the first tag, until closing the last one.
I have resolved with the following REGEX
import re
m = re.findall("[<][\w|\W]*[>]*", str(corpo), re.IGNORECASE)
for i in m:
corpo = corpo.replace(i, "")
That deletes EVERYTHING you have:
<QUALQUER_COISA> ISSO TAMBÉM <OPA ISSO TAMBÉM>
Thanks for the help.
I did it that way
import re
string = """
texto textinho
outro texto
<div dir 'ltr'><div><div> bla bla ....
"""
r = re.search("[<].*[>]",string)
# retorna "<div dir 'ltr'><div><div>"
r.group(0)
result = string.replace(r.group(0),"")
#result vai conter ' \n texto textinho \n outro texto\n\n bla bla ....\n\n '