How to manage the running and failure to execute the Spiders?

Question

How to manage the running and failure to execute the Spiders?

Navigation

#1 by (3 votes)

3

I'm developing a module to get information about spiders running on the company system. Here is the template where we saved the start of operations and the job. I would like to validate if the jobs were done correctly and fill in the rest of the fields.

models.py

# -*- coding: utf-8 -*-

from django.db import models


class CrawlerJob(models.Model):

    job_id = models.CharField(verbose_name=u'ID da Tarefa', editable=False,
                              max_length=255, blank=True, null=True,)

    job_started_dt = models.DateTimeField(
        verbose_name=u'data de início da tarefa', blank=True, null=True,
        editable=False)

    job_has_errors = models.BooleanField(
        verbose_name=u'erros?', blank=True, default=False)


    job_finished_dt = models.DateTimeField(
        verbose_name=u'data de fim da tarefa', blank=True, null=True,
        editable=False)

tasks.py

# -*- coding: utf-8 -*-

from app.models import CrawlerJob
from celery.decorators import periodic_task
from celery.task.schedules import crontab
from django.utils import timezone
from scrapyd_api import ScrapydAPI
import celery
import datetime

scrapy_url = 'http://localhost:6800'
scrapyd = ScrapydAPI(scrapy_url)


@periodic_task(run_every=(crontab(hour="6-19")))
def funcao_assincrona():

    crj = CrawlerJob()
    job_id = scrapyd.schedule('projeto_X', 'rodar_spider')
    crj.job_id = job_id
    crj.job_started_dt = timezone.now()
    crj.save()

An idea for this was to have access to the system logs and check for the generated json as below.

2015-01-09 12:40:18-0300 [spider] INFO: Closing spider (finished)
2015-01-09 12:40:18-0300 [spider] INFO: Stored jsonlines feed (11 items) in: scrapyd_build/items/projeto_X/spider/5a3bc7ca980e11e4b396600308991ea6.jl
2015-01-09 12:40:18-0300 [spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2448,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 2,
     'downloader/response_bytes': 28218,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 5,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 9, 15, 40, 18, 93445),
     'item_scraped_count': 11,
     'log_count/DEBUG': 19,
     'log_count/INFO': 8,
     'request_depth_max': 4,
     'response_received_count': 5,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2015, 1, 9, 15, 40, 13, 90020)}
2015-01-09 12:40:18-0300 [spider] INFO: Spider closed (finished)

Is there any more practical way to get this information? And in the event of errors getting the messages generated by possible errors or exceptions?

python web-crawler scrapy

asked by anonymous 12.01.2015 / 17:48

1 answer

Pin icon on lower right corner using Materialize Concatenate Strings in C

score 3 · Accepted Answer

Well, as anyone who has access to real stats is Scrapy (scrapyd only runs the jobs), I think the way to solve this problem is to use spider middleware to send crawler statistics to your application when the spider finishes.

You'll also need a way to update the application in a Scrapy spider, and fire it in spider middleware.

Here's a rough draft:

from scrapy import signals
import os

class UpdateStatsMiddleware(object):
    def __init__(self, crawler):
        self.crawler = crawler
        # registra método close_spider como callback para o sinal spider_closed
        crawler.signals.connect(self.close_spider, signals.spider_closed)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def close_spider(self, spider, reason):
        spider.log('Finishing spider with reason: %s' % reason)
        stats = self.crawler.stats.get_stats()
        jobid = self.get_jobid()
        self.update_job_stats(jobid, stats)

    def get_jobid(self):
        """Gets jobid through scrapyd's SCRAPY_JOB env variable"""
        return os.environ['SCRAPY_JOB']

    def update_job_stats(self, jobid, stats):
        # TODO: atualizar as stats na aplicação Django
        pass

How to manage the running and failure to execute the Spiders?

1 answer

Read more: