I need help in a python crawler

1
from scrapy.spiders import BaseSpider
from scrapy.selector import HtmlXPathSelector
from crawler.items import crawlerlistItem

class MySpider(BaseSpider):
    name = "epoca"
    allowed_domains = ["epocacosmeticos.com.br"]
    start_urls = ["http://www.epocacosmeticos.com.br/maquiagem"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.xpath("//span[@class='pl']")
        items = []
        for titles in titles:
            item = crawlerlistItem()
            item["title"] = titles.select("a/text()").extract()
            item["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

I have this spider, but wanted to get all the urls of epocacosmeticos.com.br with product name, title and url without the information being duplicated, can anyone help me?

    
asked by anonymous 16.02.2017 / 00:32

2 answers

4

If the problem is just that there is duplicate information inside your items , you can check if it already exists before you append:

...
item["title"] = titles.select("a/text()").extract()
item["link"] = titles.select("a/@href").extract()
if item not in items:
    items.append(item)

For prevention of duplicates in a collection at first analysis I was going to suggest using set() , but since item is a dictionary ( is changeable ) it's better to do what I put on top not to walk with many turns.

    
16.02.2017 / 09:11
1

The solution proposed by Miguel is valid for the case of this spider, since it only makes one request (the first, made to the URL in start_urls ). However, it is very common to have spiders that, after collecting data from a page in the parse() (or other callback) method, make new requests for URLs found on the page itself.

However, in Scrapy projects it is a good practice to separate the logic of data validation and transformation in Item Pipelines .

To do this, simply create a pipeline like the example below in the pipelines.py file inside the folder of your project:

from scrapy.exceptions import DropItem


class DropDuplicatesPipeline(object):
    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):
        if item['link'] in self.urls_seen:
            raise DropItem('Duplicate item found: {}'.format(item['link']))
        else:
            self.urls_seen.add(item['link'])
            return item

And enable it in the settings.py file with the following snippet:

ITEM_PIPELINES = {
    'your_project.pipelines.DropDuplicatesPipeline': 300,
}

Once this is done, any item extracted by your spider will go through the process_item method above, being rejected if it has already been extracted previously.

    
01.03.2017 / 14:36