I am a beginner in web scraping. How to learn how to make a database from data on selling new cars on some websites. One of the sites eh esse
url = https://www.seminovosunidas.com.br/veiculos/page:1?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-%22
I can get the data I need from the page normally. To iterate use a url.format passing as an argument an index that is increasing the page.
The complete code:
import requests as req
from bs4 import BeautifulSoup as bs
def get_unidas():
url = "https://www.seminovosunidas.com.br/veiculos/page:{}?utm_source=afilio&utm_medium=display&utm_campaign=maio&utm_content=ron_ambos&utm_term=120x600_promocaomaio_performance_-_-"
indice_pagina = 1
dados = {}
while True:
#headers = {'User-Agent':random.choice(user_agent_list)}
r = req.get(url.format(indice_pagina))
if r.status_code != req.codes.ok:
raise Exception("Página inexistente")
soup = bs(r.text, "lxml")
carros = soup.find_all(class_="vehicleDescription")
valores = soup.find_all(class_="valor")
for carro, valor in zip(carros,valores):
texto = list(carro.stripped_strings)
dados["Empresa"] = "Unidas"
dados["Modelo"] = texto[2]
dados["Preco"] = valor.text.replace(".","").replace(",",".")
dados["Kilometragem"] = texto[4].split(",")[1][5:]
dados["Ano"] = texto[3][-5:-1]
#print(dados)
#print("#######################################")
get_unidas()
The problem is that I do not know how to do so while the while is over. When you access a page with a nonexistent index, index 200 for example, it returns to page 1. Usually a non existent page has different HTML, so I can differentiate it from a page that exists. Even checking the status_code when accessing a non existent page is returned 200, the code that indicates existing page