Scraping with python

6

How to capture one or more sentences from a website with Python and / or regular expressions?

I want everything that starts with

<p>&#8220; e acabe com &#8221;</p>

Example:

<p>&#8220;frasefrasefrasefrasefrasefrasefrasefrase.&#8221;</p>

How to proceed?

    
asked by anonymous 26.01.2015 / 23:47

1 answer

4

You can use the expression #8220;(\w.+)&#8221 that will correspond to numbers and letters (lowercase and uppercase) and . that are between #8220; and &#8221 .

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

dados = """<p>&#8220;Linha 1&#8221;</p>
<p>&#8220;Linha 2&#8221;</p>

<p>&#8220;Linha 3 &#8221;</p>
"""

regex = re.compile("#8220;(\w.+)&#8221", re.MULTILINE)
matches = regex.findall(dados)

if matches:
    print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']

As you can see will return a list, to access a specific value do:

print(matches[0])
# Saída: Linha 1

DEMO

Note: Regular expressions are not recommended for dealing with html / xml file structures, the correct one would be to use a parser , such as Beautifulsoup that serves very well for this purpose of scraping .

See an example:

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://pt.stackoverflow.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)

for li in soup.findAll('li'):
    for a in li.findAll('a'):
        print("%-45s: %s" %(a.text, a['href']))
    
27.01.2015 / 01:05