Iterating web pages using Requests and Python

Question

Iterating web pages using Requests and Python

Navigation

#1 by (1 votes)

0

I am a beginner in web scraping. How to learn how to make a database from data on selling new cars on some websites. One of the sites eh esse

url = https://www.seminovosunidas.com.br/veiculos/page:1?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-%22

I can get the data I need from the page normally. To iterate use a url.format passing as an argument an index that is increasing the page.

The complete code:

import requests as req
from bs4 import BeautifulSoup as bs

def get_unidas():
    url = "https://www.seminovosunidas.com.br/veiculos/page:{}?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-"
    indice_pagina = 1
    dados = {}
    while True:
        #headers = {'User-Agent':random.choice(user_agent_list)}
        r = req.get(url.format(indice_pagina))
        if r.status_code != req.codes.ok:
            raise Exception("Página inexistente") 
        soup = bs(r.text, "lxml")
        carros = soup.find_all(class_="vehicleDescription")
        valores = soup.find_all(class_="valor")
        for carro, valor in zip(carros,valores):
            texto = list(carro.stripped_strings)
            dados["Empresa"] = "Unidas"
            dados["Modelo"] = texto[2]
            dados["Preco"] = valor.text.replace(".","").replace(",",".")
            dados["Kilometragem"] = texto[4].split(",")[1][5:]
            dados["Ano"] = texto[3][-5:-1]
            #print(dados)
            #print("#######################################")        

get_unidas()

The problem is that I do not know how to do so while the while is over. When you access a page with a nonexistent index, index 200 for example, it returns to page 1. Usually a non existent page has different HTML, so I can differentiate it from a page that exists. Even checking the status_code when accessing a non existent page is returned 200, the code that indicates existing page

python-3.x web-scraping http-request python-requests

asked by anonymous 19.05.2018 / 07:32

1 answer

Doubt with TreeView manipulation and delimited TXT files What is $$ hasKey in array items?

score 1 · Accepted Answer

When you are on a page, the value active is set to the attribute of the li paging attribute:

<ul class="list-unstyled list-inline header-paginator pull-right">
  <li class="active number"><a>1</a></li>
  <li class="number"><a href="/veiculos/page:2?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">2</a></li>
  <li class="number"><a href="/veiculos/page:3?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">3</a></li>
  <li class="disabled"><a>...</a></li>
  <li class="number"><a href="/veiculos/page:106?utm_source=afilio&amp;utm_medium=display&amp;utm_campaign=maio&amp;utm_content=ron_ambos&amp;utm_term=120x600_promocaomaio_performance_-_-%22">106</a></li>
</ul>

Then, instead of checking the status_code , you can search the li.active.number element and check the text, if the current index is greater than or equal to 2 and the element value is equal to 1, you end the loop.

Remove the lines:

if r.status_code != req.codes.ok:
  raise Exception("Página inexistente")

and below the line:

soup = bs(r.text, "lxml")

place:

pagina_atual = list(soup.find(class_="active number").stripped_strings)[0]
if indice_pagina >= 2 and pagina_atual == '1': break

Also do not forget to increment in indice_pagina , otherwise it will stay in infinite loop, after for put:

indice_pagina += 1

See working at repl.it