Website with hidden HTML

0

I need to extract the sales data for new cars on some websites.

One of the sites is the Locamerica company. However, on her site does not appear in the page HTML content that I need to extract.

I need to extract the data of each car present on the page, but they do not appear in the HTML. Not even external links to the car page appear.

I downloaded the source code, I ran it and it appears the same site but without any car. Link of the HTML that appears to me

I'm programming in python and I use Requests to get the HTML of the page and Beutiful Soup to extract the data I need.

The code

import requests as req
from bs4 import BeautifulSoup as bs

url = "https://seminovos.locamerica.com.br/seu-carro?combustivel=&cor=&q=&cambio=&combustiveis=&cores=&acessorios=&estado=0&loja=0&marca=0&modelo=0&anode=&anoate=&per_page={}&precode=0&precoate=0"
indice_pagina = 1

r = req.get(url.format(indice_pagina))
print(r.text)
    
asked by anonymous 19.05.2018 / 23:29

1 answer

2

This is because the page initially does not contain the information about the cars. It is loaded empty, and then uses JavaScript to dynamically load the data and insert it into the page.

One way to get around this is by using a webdriver such as Selenium . Basically, you run a browser that is controlled by your Python program.

When possible, it is best to avoid this, however; by running an entire browser, which loads all the images and scripts and advertisements, the process is considerably slower than just using simple requests.

What you can do is open your browser's developer tools, open the Network tab, and observe the requests your browser makes while loading the page. Sometimes what loads interesting content is a simple call to a website API. In this case, you can make your request for this API.

I did this and saw some things that seemed interesting:

TheotherJSONrequestsarenotinteresting;appeartobefilteringoptionsandutilities.Thisoneseemedabitstrangetome;didnotbringtheinformationofthecarsdirectly,butthestrangeformatseemedtobe Base64 .

I copied the veiculos field and pasted it into a decoder site to confirm my suspicions, and in fact, the message becomes HTML:

As proof of concept for getting this HTML with Python:

import requests
import base64

url = 'https://seminovos.locamerica.com.br/veiculos.json?marca=&precode=&precoate=&ano_de=0&cambio=&acessorios=&current_url=https://seminovos.locamerica.com.br/seu-carro?marca=&cambio=&combustivel=&cor=&acessorios=&anode=0&precode=&precoate='

r = requests.get(url)
info = r.json()['veiculos']
info_decoded = base64.b64decode(info)

print(info_decoded)
    
20.05.2018 / 02:35