Problems with parameter restrict_xpaths in a crawler

1

I have no Python experience but I decided to try doing anything with Scrapy for testing. So I'm trying to collect the existing articles on a given page, namely a DIV element with a devBody ID.

In this sense, my goal is to get the title of the article and its URL. So I set a rule to go through just the content of that element.

It turns out that, for some reason, link fetching is not only limited to this element, which causes irrelevant links to be collected and then "shuffled" the title-URL pairs when I try ramp up. Here is the code:

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem


class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["http://dev.mysql.com/tech-resources/articles/"]


    rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)


    def parse(self, response):
        entries = response.xpath('//h4')
        items = []    
        #usar um contador aqui não será, de certeza, a melhor solução mas foi a única que encontrei para não receber todos os dados recolhidos num único objecto
        i = 0            
        for entry in entries:
            item = StackItem()
            item['title'] = entry.xpath('//a/text()').extract()[i]
            item['url'] = entry.xpath('//a/@href').extract()[i]
            yield item
            items.append(item)
            i += 1

To try to figure out what's going on, I've turned to Chrome's Developer Tools and through XPath queries everything seems to be right . However, when I try to replicate the same logic in the code, something goes wrong. According to the logs, it is said that 57 links were actually collected, but there are quite a few that are out of scope (such a div with devBody ID).

I have no idea what this behavior might be causing. I'm using version 1.0.5 of Scrapy and Python 2.7.

Thank you in advance for any help.

    
asked by anonymous 10.03.2016 / 20:15

1 answer

2

According to this response , the code structure has been changed to work as intended. Here's the end result:

from scrapy.spiders import Spider
from stack.items import StackItem

class StackSpider(Spider):
    handle_httpstatus_list = [403, 404]
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["https://dev.mysql.com/tech-resources/articles/"]

    def parse_items(self, response):
        for row in response.xpath('//div[@id="devBody"]/h4'):
            item = StackItem()
            item['title'] = row.xpath('a/text()').extract()
            # get the full url
            item['url'] = response.urljoin(row.xpath('a/@href').extract_first())
            yield item
    
11.03.2016 / 13:57