I need to do a crawler on a book site. I can get the data I need from a page and I also got the entire domain, but I wanted to do it in a more orderly and logical way.
I'd like to start by extracting for a URL and letting the code go through this url increment.
class QuotesSpider(CrawlSpider):
name = "adororomance2"
start_urls = [
'http://www.adororomances.com.br/arromances.php?cod=1',
]
So after collecting the data, I'd like it to go to the url that is equal to start_urls
but the end is '.php? cod = 2' and then callback to extract the data from the new url
and continue like this until on a page he did not find the book title and then stopped.
What I've tried so far and it did not work:
def parse(self, response):
for livro in response.xpath('//*[@id="page_livro_coluna"]'):
yield {
'titulo':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[1]/h1/text()').extract_first(),
'autor(a)':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[2]/span/a/span/h2/text()').extract_first(),
'titulo original':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[3]/text()').extract_first(),
'coleção':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[4]/h3/a/text()').extract_first(),
'publicação':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[4]/div[1]/span[1]/text()').extract_first(),
'ano':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[4]/div[1]/span[2]/text()').extract_first(),
'série':
livro.xpath(
'//*[@id="page_livro_coluna"]/div[4]/div[2]/a/span/text()').extract_first(),
'descrição':
livro.xpath(
'normalize-space(//*[@id="description"]/text())')
.extract_first(),
}
i = 2
next_page = '''
http://www.adororomances.com.br/arromances.php?cod=
''' + %i
if titulo is not '':
i = i + 1
yield response.follow(next_page, callback=self.parse)