I'm developing a module to get information about spiders running on the company system. Here is the template where we saved the start of operations and the job. I would like to validate if the jobs were done correctly and fill in the rest of the fields.
models.py
# -*- coding: utf-8 -*-
from django.db import models
class CrawlerJob(models.Model):
job_id = models.CharField(verbose_name=u'ID da Tarefa', editable=False,
max_length=255, blank=True, null=True,)
job_started_dt = models.DateTimeField(
verbose_name=u'data de início da tarefa', blank=True, null=True,
editable=False)
job_has_errors = models.BooleanField(
verbose_name=u'erros?', blank=True, default=False)
job_finished_dt = models.DateTimeField(
verbose_name=u'data de fim da tarefa', blank=True, null=True,
editable=False)
tasks.py
# -*- coding: utf-8 -*-
from app.models import CrawlerJob
from celery.decorators import periodic_task
from celery.task.schedules import crontab
from django.utils import timezone
from scrapyd_api import ScrapydAPI
import celery
import datetime
scrapy_url = 'http://localhost:6800'
scrapyd = ScrapydAPI(scrapy_url)
@periodic_task(run_every=(crontab(hour="6-19")))
def funcao_assincrona():
crj = CrawlerJob()
job_id = scrapyd.schedule('projeto_X', 'rodar_spider')
crj.job_id = job_id
crj.job_started_dt = timezone.now()
crj.save()
An idea for this was to have access to the system logs and check for the generated json as below.
2015-01-09 12:40:18-0300 [spider] INFO: Closing spider (finished)
2015-01-09 12:40:18-0300 [spider] INFO: Stored jsonlines feed (11 items) in: scrapyd_build/items/projeto_X/spider/5a3bc7ca980e11e4b396600308991ea6.jl
2015-01-09 12:40:18-0300 [spider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2448,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 3,
'downloader/request_method_count/POST': 2,
'downloader/response_bytes': 28218,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 1, 9, 15, 40, 18, 93445),
'item_scraped_count': 11,
'log_count/DEBUG': 19,
'log_count/INFO': 8,
'request_depth_max': 4,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2015, 1, 9, 15, 40, 13, 90020)}
2015-01-09 12:40:18-0300 [spider] INFO: Spider closed (finished)
Is there any more practical way to get this information? And in the event of errors getting the messages generated by possible errors or exceptions?