Error with requests with scrapy

Question

Error with requests with scrapy

Navigation

#1 by (4 votes)

3

I have a csv file with some urls that need to be accessed.

http://www.icarros.com.br/Audi, Audi
http://www.icarros.com.br/Fiat, Fiat
http://www.icarros.com.br/Chevrolet, Chevrolet

I have a spider to do all the requirments.

import scrapy
import csv
from scrapy.selector import Selector

class ModelSpider(scrapy.Spider):
    name = "config_brands"
    start_urls = [
        'http://www.icarros.com/'
    ]

    def parse(self, response):
        file = open("files/brands.csv")
        reader = csv.reader(file)

        for line in reader:
            yield scrapy.Request(line[0], self.success_connect, self.error_connect)

    def success_connect(self, response):
        self.log('Entrei na url: %s' %response.url)

    def error_connect(self, response):
        self.log('Nao foi possivel %s' %response.url)

When I try to run the spider it can not connect to any of the urls and if I enter the same url in the browser it can access it normally. And my errback function also does not work.

Debug:

2016-09-09 10:17:00 [scrapy] DEBUG: Crawled (200) <GET http://www.icarros.com.br/principal/index.jsp> (referer: None)
2016-09-09 10:17:00 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 1 times): 400 Bad Request
2016-09-09 10:17:07 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 2 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Gave up retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 3 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Crawled (400) <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (referer: http://www.icarros.com.br/principal/index.jsp)

python python-3.x python-2.7 web-scraping scrapy

asked by anonymous 09.09.2016 / 15:19

1 answer

How to compress the output of an HTML in Laravel? Questions about the Spring XML file?

score 4 · Accepted Answer

You have at least two ways to resolve this.

The first is to specify to the middleware that you want to deal with response codes that are outside the 200-300 range, do this in handle_httpstatus_list :

class ModelSpider(scrapy.Spider):
    name = "config_brands"
    handle_httpstatus_list = [400, 403]

See documentation . / a> for more details.

And my errback function also does not work.

Specify callback and errback :

yield scrapy.Request(line[0], callback = self.success_connect, 
                               errback = self.error_connect)

When making these two changes, your code should work as expected.

An alternative is to use the start_requests method. , which should be more appropriate than parse in this if you want to access a list of URLs, parse is usually used to process the response.

You can do this:

class ModelSpider(scrapy.Spider):
    name = "config_brands"

    def start_requests(self):
        with open('brands.csv', 'r') as f:
            reader = csv.reader(f)

            for url, modelo in reader:
                yield scrapy.Request(url, callback = self.success_connect, 
                                           errback = self.error_connect)

No sucess_connect you treat the received response, see an example:

def success_connect(self, response):
    self.logger.info('Entrei na url: {}'.format(response.url))

    anuncios = response.xpath('//div[@class="dados_veiculo"]')

    for anuncio in anuncios:
        titulo = anuncio.xpath('a[@class="clearfix"]/@title').extract()[0]
        valor = anuncio.xpath('a/p/text()').extract()[0]

        # Para lidar com caracteres acentuados
        titulo = titulo.encode('utf-8')
        valor = valor.encode('utf-8')

        print ("{}: {}".format(titulo, valor))

No error_connect do the treatment, or report the error:

def error_connect(self, failure):
        self.logger.error('Nao foi possivel: {}'.format(failure.url))

If you prefer to properly handle exceptions that occur in order processing, take a look at this example in the documentation .