Multiple pipelines to handle different spiders in Scrapy

Question

Multiple pipelines to handle different spiders in Scrapy

Navigation

#1 by (2 votes)

3

How to handle pipelines.py when we have different spiders?

Example: I have a Spider that works by getting blog posts from one blog and another by saving images of jpeg banners found on each page. Both spiders work, but I use the same pipeline to persist the objects.

python web-crawler scrapy

asked by anonymous 09.01.2015 / 20:54

1 answer

Internationalization, Localization and Globalization Is it possible to rename a folder with files inside using VBA (Outlook)?

score 2 · Accepted Answer

It is a common pattern in pipelines (and in spider middlewares as well) to use spider attributes to decide what to do:

class MyPipeline:
    def process_item(self, item, spider):
        if getattr(spider, 'my_pipeline_enabled', False):
            # faz a coisa aqui

In this way, although the pipeline is enabled in the entire project, you can use the my_pipeline_enabled attribute to enable the pipeline for just the spiders you want.

You can also expand this code to consider a setting, if necessary.

In Scrapy 0.25+ (not yet released, for now just by taking the Git repo), there is also the alternative of using settings in the spider that take precedence over those of the project.