render specific part of a page

5

I'm using the following code to render a webpage:

import dryscrape

# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://www.google.com.br')

# we don't need images
sess.set_attribute('auto_load_images', True)

# visit
sess.visit("/")

sess.render("google.png")

But, I would like to render only part of the page, for example in google site, I would like to render only the doodle ( <div id=dood class=cta> )

I tried to replace the last line with:

sess.at_css('.cta').render("google.png")

But this is not allowed. Does anyone know any way?

    
asked by anonymous 26.05.2015 / 16:08

2 answers

1

If dryscrape was not a requirement of the solution you can do a combination of requests to google and word processing with regex.

The idea is to read the google page, find (via regex) the address of the doodle, mount the final url, download the file and save to disk:

# -*- coding: utf-8 -*-
from urllib2 import urlopen
import re


response = urlopen('http://www.google.com.br/').read()
m = re.search(
    r'background:url\(([^)]+)\).+id="hplogo"',
    response
)

final_url = 'http://www.google.com.br{}'.format(m.group(1))
print 'Downloading {}'.format(final_url)

image = urlopen(final_url).read()
with open('google.png', 'wb') as f:
    f.write(image)

You may need to read the content-type of the image before saving to disk.

This has a risk, if google changes the layout of the page, most likely your regex will not match, then you would have to redo it.

    
29.05.2015 / 06:19
1

@ drgarcia1986's solution is what I would try, but if you are keen to use [dryscrape] (motivated by the fact that many Google doodles are animated and use Flash / HTML5?), one option would be you, somehow, edit the main page's HTML to leave just the doodle. If you find a way to make [dryscrape] open an HTML file that you've generated, you can try something using BeautifulSoup >:

soup = BeautifulSoup(codigo_html_do_google)
soup.body = soup.find(**{'class': 'cta'})
codigo_html_simplificado = str(soup.body)

(You may need to be careful not to destroy other elements of the page as scripts, but the general idea is this)

    
03.06.2015 / 16:55