I have no Python experience but I decided to try doing anything with Scrapy for testing. So I'm trying to collect the existing articles on a given page, namely a DIV element with a devBody ID.
In this sense, my goal is to get the title of the article and its URL. So I set a rule to go through just the content of that element.
It turns out that, for some reason, link fetching is not only limited to this element, which causes irrelevant links to be collected and then "shuffled" the title-URL pairs when I try ramp up. Here is the code:
from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["dev.mysql.com"]
start_urls = ["http://dev.mysql.com/tech-resources/articles/"]
rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)
def parse(self, response):
entries = response.xpath('//h4')
items = []
#usar um contador aqui não será, de certeza, a melhor solução mas foi a única que encontrei para não receber todos os dados recolhidos num único objecto
i = 0
for entry in entries:
item = StackItem()
item['title'] = entry.xpath('//a/text()').extract()[i]
item['url'] = entry.xpath('//a/@href').extract()[i]
yield item
items.append(item)
i += 1
To try to figure out what's going on, I've turned to Chrome's Developer Tools and through XPath queries everything seems to be right . However, when I try to replicate the same logic in the code, something goes wrong. According to the logs, it is said that 57 links were actually collected, but there are quite a few that are out of scope (such a div with devBody ID).
I have no idea what this behavior might be causing. I'm using version 1.0.5 of Scrapy and Python 2.7.
Thank you in advance for any help.