Error with requests with scrapy

3

I have a csv file with some urls that need to be accessed.

http://www.icarros.com.br/Audi, Audi
http://www.icarros.com.br/Fiat, Fiat
http://www.icarros.com.br/Chevrolet, Chevrolet

I have a spider to do all the requirments.

import scrapy
import csv
from scrapy.selector import Selector

class ModelSpider(scrapy.Spider):
    name = "config_brands"
    start_urls = [
        'http://www.icarros.com/'
    ]

    def parse(self, response):
        file = open("files/brands.csv")
        reader = csv.reader(file)

        for line in reader:
            yield scrapy.Request(line[0], self.success_connect, self.error_connect)

    def success_connect(self, response):
        self.log('Entrei na url: %s' %response.url)

    def error_connect(self, response):
        self.log('Nao foi possivel %s' %response.url)

When I try to run the spider it can not connect to any of the urls and if I enter the same url in the browser it can access it normally. And my errback function also does not work.

Debug:

2016-09-09 10:17:00 [scrapy] DEBUG: Crawled (200) <GET http://www.icarros.com.br/principal/index.jsp> (referer: None)
2016-09-09 10:17:00 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 1 times): 400 Bad Request
2016-09-09 10:17:07 [scrapy] DEBUG: Retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 2 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Gave up retrying <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (failed 3 times): 400 Bad Request
2016-09-09 10:17:14 [scrapy] DEBUG: Crawled (400) <<BOUND METHOD MODELSPIDER.ERROR_CONNECT OF <MODELSPIDER 'CONFIG_BRANDS' AT 0X7F7D18B45990>> http://www.icarros.com.br/Audi> (referer: http://www.icarros.com.br/principal/index.jsp)
    
asked by anonymous 09.09.2016 / 15:19

1 answer

4

You have at least two ways to resolve this.

  • The first is to specify to the middleware that you want to deal with response codes that are outside the 200-300 range, do this in handle_httpstatus_list :

    class ModelSpider(scrapy.Spider):
        name = "config_brands"
        handle_httpstatus_list = [400, 403]
    

    See documentation . / a> for more details.

      

    And my errback function also does not work.

    Specify callback and errback :

    yield scrapy.Request(line[0], callback = self.success_connect, 
                                   errback = self.error_connect)
    

    When making these two changes, your code should work as expected.

  • An alternative is to use the start_requests method. , which should be more appropriate than parse in this if you want to access a list of URLs, parse is usually used to process the response.

    You can do this:

    class ModelSpider(scrapy.Spider):
        name = "config_brands"
    
        def start_requests(self):
            with open('brands.csv', 'r') as f:
                reader = csv.reader(f)
    
                for url, modelo in reader:
                    yield scrapy.Request(url, callback = self.success_connect, 
                                               errback = self.error_connect)
    

    No sucess_connect you treat the received response, see an example:

    def success_connect(self, response):
        self.logger.info('Entrei na url: {}'.format(response.url))
    
        anuncios = response.xpath('//div[@class="dados_veiculo"]')
    
        for anuncio in anuncios:
            titulo = anuncio.xpath('a[@class="clearfix"]/@title').extract()[0]
            valor = anuncio.xpath('a/p/text()').extract()[0]
    
            # Para lidar com caracteres acentuados
            titulo = titulo.encode('utf-8')
            valor = valor.encode('utf-8')
    
            print ("{}: {}".format(titulo, valor))
    

    No error_connect do the treatment, or report the error:

    def error_connect(self, failure):
            self.logger.error('Nao foi possivel: {}'.format(failure.url))
    
  • If you prefer to properly handle exceptions that occur in order processing, take a look at this example in the documentation .

        
    11.09.2016 / 05:16