How to manage the running and failure to execute the Spiders?

3

I'm developing a module to get information about spiders running on the company system. Here is the template where we saved the start of operations and the job. I would like to validate if the jobs were done correctly and fill in the rest of the fields.

models.py

# -*- coding: utf-8 -*-

from django.db import models


class CrawlerJob(models.Model):

    job_id = models.CharField(verbose_name=u'ID da Tarefa', editable=False,
                              max_length=255, blank=True, null=True,)

    job_started_dt = models.DateTimeField(
        verbose_name=u'data de início da tarefa', blank=True, null=True,
        editable=False)

    job_has_errors = models.BooleanField(
        verbose_name=u'erros?', blank=True, default=False)


    job_finished_dt = models.DateTimeField(
        verbose_name=u'data de fim da tarefa', blank=True, null=True,
        editable=False)

tasks.py

# -*- coding: utf-8 -*-

from app.models import CrawlerJob
from celery.decorators import periodic_task
from celery.task.schedules import crontab
from django.utils import timezone
from scrapyd_api import ScrapydAPI
import celery
import datetime

scrapy_url = 'http://localhost:6800'
scrapyd = ScrapydAPI(scrapy_url)


@periodic_task(run_every=(crontab(hour="6-19")))
def funcao_assincrona():

    crj = CrawlerJob()
    job_id = scrapyd.schedule('projeto_X', 'rodar_spider')
    crj.job_id = job_id
    crj.job_started_dt = timezone.now()
    crj.save()

An idea for this was to have access to the system logs and check for the generated json as below.

2015-01-09 12:40:18-0300 [spider] INFO: Closing spider (finished)
2015-01-09 12:40:18-0300 [spider] INFO: Stored jsonlines feed (11 items) in: scrapyd_build/items/projeto_X/spider/5a3bc7ca980e11e4b396600308991ea6.jl
2015-01-09 12:40:18-0300 [spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 2448,
     'downloader/request_count': 5,
     'downloader/request_method_count/GET': 3,
     'downloader/request_method_count/POST': 2,
     'downloader/response_bytes': 28218,
     'downloader/response_count': 5,
     'downloader/response_status_count/200': 5,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2015, 1, 9, 15, 40, 18, 93445),
     'item_scraped_count': 11,
     'log_count/DEBUG': 19,
     'log_count/INFO': 8,
     'request_depth_max': 4,
     'response_received_count': 5,
     'scheduler/dequeued': 5,
     'scheduler/dequeued/memory': 5,
     'scheduler/enqueued': 5,
     'scheduler/enqueued/memory': 5,
     'start_time': datetime.datetime(2015, 1, 9, 15, 40, 13, 90020)}
2015-01-09 12:40:18-0300 [spider] INFO: Spider closed (finished)

Is there any more practical way to get this information? And in the event of errors getting the messages generated by possible errors or exceptions?

    
asked by anonymous 12.01.2015 / 17:48

1 answer

3

Well, as anyone who has access to real stats is Scrapy (scrapyd only runs the jobs), I think the way to solve this problem is to use spider middleware to send crawler statistics to your application when the spider finishes.

You'll also need a way to update the application in a Scrapy spider, and fire it in spider middleware.

Here's a rough draft:

from scrapy import signals
import os

class UpdateStatsMiddleware(object):
    def __init__(self, crawler):
        self.crawler = crawler
        # registra método close_spider como callback para o sinal spider_closed
        crawler.signals.connect(self.close_spider, signals.spider_closed)

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def close_spider(self, spider, reason):
        spider.log('Finishing spider with reason: %s' % reason)
        stats = self.crawler.stats.get_stats()
        jobid = self.get_jobid()
        self.update_job_stats(jobid, stats)

    def get_jobid(self):
        """Gets jobid through scrapyd's SCRAPY_JOB env variable"""
        return os.environ['SCRAPY_JOB']

    def update_job_stats(self, jobid, stats):
        # TODO: atualizar as stats na aplicação Django
        pass

Read more:

13.01.2015 / 03:52