You can use the expression #8220;(\w.+)”
that will correspond to numbers and letters (lowercase and uppercase) and .
that are between #8220;
and ”
.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
dados = """<p>“Linha 1”</p>
<p>“Linha 2”</p>
<p>“Linha 3 ”</p>
"""
regex = re.compile("#8220;(\w.+)”", re.MULTILINE)
matches = regex.findall(dados)
if matches:
print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']
As you can see will return a list, to access a specific value do:
print(matches[0])
# Saída: Linha 1
DEMO
Note: Regular expressions are not recommended for dealing with html / xml file structures, the correct one would be to use a parser , such as Beautifulsoup
that serves very well for this purpose of scraping .
See an example:
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
url = 'http://pt.stackoverflow.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)
for li in soup.findAll('li'):
for a in li.findAll('a'):
print("%-45s: %s" %(a.text, a['href']))