Thread control to prevent lock

9

I'm starting a crawler project and the idea is to download all content from a particular website. It already has some "usable" features. What is killing me is, I am doing it multithread, however the threads in a moment stop and I do not know how to avoid it.

I made some tests and noticed that the threads are still alive. They're still there, but they seem to be in a lock state.

It may take 5 seconds or 5 hours but one thing is certain, it will lock. And I'd like to trust my crawler to the point of letting it run 24 hours a day.

So here are my questions:

Is there a limit to the number of threads I can use?

How do I prevent my thread from locking?

class Fetcher(Thread):

    wait_time = 7
    dispatcher = None
    work = None

    def __init__(self, dispatcher, *args, **kwargs):
        Thread.__init__(self, *args, **kwargs)
        self.dispatcher = dispatcher
        self.wait_time = kwargs.get('wait_time', 7)
        self.start()

    def request_work(self):
        self.work = None
        if self.dispatcher.has_work():
            self.work = self.dispatcher.get_work()

    def do(self):
        if self.work is not None:
            self.fetch_url()

    def fetch_url(self):
        request = urllib2.Request(self.work.url)

        try:
            response = urllib2.urlopen(request)
            html = buffer(response.read())
            page = Page(self.work, html)
            page.save()
        except urllib2.URLError:
            self.dispatcher.fill_pool([self.work,])
        except sqlite3.OperationalError:
            self.dispatcher.fill_pool([self.work,])
        except:
            self.dispatcher.fill_pool([self.work,])

    def run(self):
        while True:
            self.request_work()
            if self.work:
                self.do()
                time.sleep(self.wait_time)

Dispatcher:

class Dispatcher:        
    def __init__(self, *args, **kwargs):
        self.pool = []

    def has_work(self):
        return len(self.pool) > 0

    def get_work(self):
        return self.pool.pop(0)

    def fill_pool(self, workload):
        self.pool = self.pool + workload

Running Example:

dispatcher = Dispatcher()
dispatcher.fill_pool(['url1', 'url2', 'url3'])
fetcher1 = Fetcher(dispatcher)
fetcher2 = Fetcher(dispatcher)
fetcher3 = Fetcher(dispatcher)
fetcher4 = Fetcher(dispatcher)

I put this example at user Brumazzi's request, but it will not run. As said before, the crawler I'm creating depends on all of its components to run without the slightest problem. And the Page class is part of the project, representing an object in the database.

    
asked by anonymous 30.03.2016 / 16:23

1 answer

4

You are using a list that is not thread safe ( dispatcher.pool ) and sharing between several workers ( Fetcher ), this may be an indication of your possible problem, try switching from a simple list to a queue (<

Queue

    
05.04.2016 / 14:55