I've grouped 2 questions because I think they're related.
I've done a script test , where saved links are stored in the database with your data.
Is this a bad practice? (High Priority)
Do I have to do something more to avoid duplicates? In my pipeline has a simple check for link=%s
, would it be better for me to use md5 (link)? Faster Inquiry?
I can use -s JOBDIR=crawls/somespider-1
to pause and return the crawler, but I would like to know how to do this by list of links to be processed in MySQL. (Low priority)
I need to add new items to my list of start_urls
, or queue, dynamically.
Should I create Request
with callback parse_category
? Is there any way I can add self.queue
or self.start_url
and add new urls to be processed? High priority