Scraping with python

Question

Scraping with python

Navigation

#1 by (4 votes)

6

How to capture one or more sentences from a website with Python and / or regular expressions?

I want everything that starts with

<p>&#8220; e acabe com &#8221;</p>

Example:

<p>&#8220;frasefrasefrasefrasefrasefrasefrasefrase.&#8221;</p>

How to proceed?

python regex

asked by anonymous 26.01.2015 / 23:47

1 answer

How to get the height of the div, lease it to the multiple of 24 closest and apply the result in the div's own style? Create extension to manipulate DOM of other pages

score 4 · Accepted Answer

You can use the expression #8220;(\w.+)&#8221 that will correspond to numbers and letters (lowercase and uppercase) and . that are between #8220; and &#8221 .

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

dados = """<p>&#8220;Linha 1&#8221;</p>
<p>&#8220;Linha 2&#8221;</p>

<p>&#8220;Linha 3 &#8221;</p>
"""

regex = re.compile("#8220;(\w.+)&#8221", re.MULTILINE)
matches = regex.findall(dados)

if matches:
    print(matches)
# Saída: ['Linha 1', 'Linha 2', 'Linha 3 ']

As you can see will return a list, to access a specific value do:

print(matches[0])
# Saída: Linha 1

DEMO

Note: Regular expressions are not recommended for dealing with html / xml file structures, the correct one would be to use a parser , such as Beautifulsoup that serves very well for this purpose of scraping .

See an example:

#!/usr/bin/env python

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

url = 'http://pt.stackoverflow.com'
html = urlopen(url).read()
soup = BeautifulSoup(html)

for li in soup.findAll('li'):
    for a in li.findAll('a'):
        print("%-45s: %s" %(a.text, a['href']))