Web Scraping with python

-2

Good evening. I want to make a simple algorithm to take data from a website ( link ). I've already done a part of the library code:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.riooilgas.com.br/?_page=programacao&_menu=programacao")
res = BeautifulSoup(html.read(),"html5lib")
tags = res.findAll(text="Evento Paralelo - O&G Techweek")
print(res.tags)

I just want the information of day, time "Parallel Event - O & G Techweek

Can anyone help me?

Thank you

    
asked by anonymous 30.08.2018 / 05:48

1 answer

0

Unfortunately, the elements you're looking for are not in the site's HTML code, but are generated dynamically via JavaScript after the page loads into a browser. Because BeautifulSoup does not execute javascript, you can not extract this data directly as it began in your code.

One of the options for this type of site is to parse the javascript code of the page, find out what it does, and "simulate" this with manually written python code. This solution is usually more efficient but much more complex to deploy.

In the specific case of the site you requested, it looks like the data is inside the javascript, as you can see here:

>>> import requests
>>> r = requests.get('http://assets.tuut.com.br/rog-pages/public/script-programacao-main.js?v=23')
>>> data = r.text
>>> data[30:100]
'po de Evento":"Congresso",Bloco:"",Categoria:"Credenciamento","Hor\xe1rio'

As you can see, it's a format similar to json, but not exactly json. They are javascript variable definitions, with the data. Using the json module in python would not work here because the text is not valid json, fortunately there is the demjson module for extracting data from json-like formats like this. Using demjson :

>>> d1 = data[data.find('['):data.find(']')+1]
>>> import demjson
>>> eventos = demjson.decode(d1)

Now we have a python object containing the events one by one in each element:

>>> for evento in eventos:
...     print(evento['Nome do evento'], 'as', evento['Horário'], 'em', evento['Lugar'])
Credenciamento as 8:00 às 17:00 em Pavilhão 1
Cerimônia de Abertura as 9:30 às 11:00 em Pavilhão 5
SP 1: A nova geopolítica do petróleo e gás as 11:10 às 12:10 em Pavilhão 5
Os desafios e oportunidades do setor de Upstream num mundo em Transição Energética as 12:25 às 13:40 em Pavilhão 5
SE 01: 40 anos da Bacia de Campos: o que vem pela frente as 14:00 às 16:00 em Pavilhão 5
SE 02: Comércio irregular de combustíveis e seus impactos – programa Combustível Legal as 14:00 às 16:00 em Pavilhão 5
...

As you can see, it was easy to extract the data from this site, they already came structured in an organized way in the javascript code. However, it is not always that easy - dynamic websites with increasingly jumbled and obscure javascript code are increasingly common. Then enter the other alternative for scraping this type of site: selenium - Selenium is a lib that lets you control a browser through python, like chrome or firefox - Using it it is possible to execute javascript. But it is far less efficient because you are running an entire browser.

    
30.08.2018 / 23:05