I'm starting a crawler project and the idea is to download all content from a particular website. It already has some "usable" features. What is killing me is, I am doing it multithread, however the threads in a moment stop and I do not know how to avoid it.
I made some tests and noticed that the threads are still alive. They're still there, but they seem to be in a lock state.
It may take 5 seconds or 5 hours but one thing is certain, it will lock. And I'd like to trust my crawler to the point of letting it run 24 hours a day.
So here are my questions:
Is there a limit to the number of threads I can use?
How do I prevent my thread from locking?
class Fetcher(Thread):
wait_time = 7
dispatcher = None
work = None
def __init__(self, dispatcher, *args, **kwargs):
Thread.__init__(self, *args, **kwargs)
self.dispatcher = dispatcher
self.wait_time = kwargs.get('wait_time', 7)
self.start()
def request_work(self):
self.work = None
if self.dispatcher.has_work():
self.work = self.dispatcher.get_work()
def do(self):
if self.work is not None:
self.fetch_url()
def fetch_url(self):
request = urllib2.Request(self.work.url)
try:
response = urllib2.urlopen(request)
html = buffer(response.read())
page = Page(self.work, html)
page.save()
except urllib2.URLError:
self.dispatcher.fill_pool([self.work,])
except sqlite3.OperationalError:
self.dispatcher.fill_pool([self.work,])
except:
self.dispatcher.fill_pool([self.work,])
def run(self):
while True:
self.request_work()
if self.work:
self.do()
time.sleep(self.wait_time)
Dispatcher:
class Dispatcher:
def __init__(self, *args, **kwargs):
self.pool = []
def has_work(self):
return len(self.pool) > 0
def get_work(self):
return self.pool.pop(0)
def fill_pool(self, workload):
self.pool = self.pool + workload
Running Example:
dispatcher = Dispatcher()
dispatcher.fill_pool(['url1', 'url2', 'url3'])
fetcher1 = Fetcher(dispatcher)
fetcher2 = Fetcher(dispatcher)
fetcher3 = Fetcher(dispatcher)
fetcher4 = Fetcher(dispatcher)
I put this example at user Brumazzi's request, but it will not run. As said before, the crawler I'm creating depends on all of its components to run without the slightest problem. And the Page
class is part of the project, representing an object in the database.