Scrapy queue queue and mysql store

2

I've grouped 2 questions because I think they're related.

I've done a script test , where saved links are stored in the database with your data.

Is this a bad practice? (High Priority)

Do I have to do something more to avoid duplicates? In my pipeline has a simple check for link=%s , would it be better for me to use md5 (link)? Faster Inquiry?

I can use -s JOBDIR=crawls/somespider-1 to pause and return the crawler, but I would like to know how to do this by list of links to be processed in MySQL. (Low priority)

I need to add new items to my list of start_urls , or queue, dynamically. Should I create Request with callback parse_category ? Is there any way I can add self.queue or self.start_url and add new urls to be processed? High priority     

asked by anonymous 13.07.2016 / 18:53

1 answer

0
  

Is this a bad practice?

No, considering that the responsibility for handling the MySQL connection is also being made using best practices in Python.

  

I can use -s JOBDIR=crawls/somespider-1 to pause and return crawler , but I would like to know how to do this by list of links to be processed in MySQL.

First you select the MySQL records. Then you can use some loopback to call requests and their respective callbacks :

for registro in registros:
    resultado = Request(url=self.registro['url'], callback=self.meu_callback)
    # Aqui você faz operações adicionais com resultado, se precisar.
  

Should I create Request with callback parse_category ?

It's the right thing to do.

  

Is there any way I can add self.queue or self.start_url and add new url's to be processed?

self.queue you will bring from your database. self.start_url will be the column of each record brought.

    
13.07.2016 / 20:01