Building a web crawler in python. I need help adding threads

2

I'm trying to develop a web crawler for studies. He is very simple and would like to improve it. How to use threads to speed up / improve the process? The program could make several links in parallel. How to apply the thread concept to the crawler?

import requests
import re

to_crawl =['http://www.g1.globo.com'] #url para fazer o crawler (a semente: ponto de partida)
crawled = set()# o conjunto do que "ja fiz"/ja percorrido, foi feito o crawer
#se a url já estiver em crawled, vou para a próxima!

#é bom usar header pra finger ser um navegador
header = {"user-agent":"Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0",
          "accept": "*/*",
          "accept-language": "en-US,en;q=0.5",
          "accept-encoding": "gzip, deflate",
          "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
          }
while True: #executar pra sempre...

    url = to_crawl[0]
    try: #tratar pox ex URL invalidas..
        req = requests.get(url,headers=header)
    except: #remove a url
        to_crawl.remove(url)
        crawled.add(url)
        continue #passa pro prox link


    #print (req.text) #é a página!
    html = req.text



    links = re.findall(r'(?<=href=["\'])https?://.+?(?=["\'])' ,html)
    print ("Crawling", url)

    #apos a requisicao, removo do to_crawl e insiro em no conjunto crawled:
    to_crawl.remove(url)
    crawled.add(url)


    #agora joga links in to_crawl se nao estiverem em crawled:
    for link in links:
        if link not in crawled and link not in to_crawl:  #se nao estiver nas 2 listas
            to_crawl.append(link)


    #print(padrao.group())
    #print(padrao,)

    for link in links:
        print(link)
    
asked by anonymous 01.08.2016 / 02:28

2 answers

4

Here's a simple (python 3.x) example, the approach is a bit different than yours:

import re
from threading import Thread
import requests

def get_links(url):
    req = get_req(url)
    if(req is not None):
        html = req.text
        urls = re.findall('(?<=href=["\'])https?://.+?(?=["\'])', html)
        return urls
    return None

def get_req(url):
    try:
        req = s.get(url)
        return req
    except Exception:
        print('[-] Erro ao ir buscar página: ', url)
        return None

def inject_links(data, url):
    urls = get_links(url)
    if(urls is not None):
        for url in urls:
            if(url not in data and len(data) < urls_max):
                data.add(url) # adicionamos aos urls crawled
                print('[+] Total: {} [+] putting: {} '.format(len(data), url))
                return inject_links(data, url)
    return

def producer(url, threadNum):
    while len(data) < urls_max:
        inject_links(data, url)
    #print('\n', data) # comentar isto depois de ter percebido, este print e muito pesado 
    print('[+] Terminated - killing thread {} -> Total urls stored: {}'.format(threadNum, len(data)))
    # aqui pode escrever para um ficheiro por exemplo

data = set()
urls_max = 100
threads = 10
start_urls = ['http://pt.stackoverflow.com/', 'http://www.w3schools.com/default.asp', 'http://spectrum.ieee.org/']

s = requests.Session()
for i in range(len(start_urls)):
    for j in range(threads):
        t = Thread(target=producer, args=(start_urls[i], '{}.{}'.format(i+1, j+1)))
        t.start()

Using set () is to increase performance when we add / find something stored there , and to avoid duplicate urls. Uncomment the print within the producer() method to view the stored urls

In this case let's start in three links with 10 threads in each, and each tread is 'dead' when we have 100 links. This condition will be the core of the program if(url not in data and len(data) < urls_max) , if the url does not already exist within our set () then we add, and if total number of urls in the set is less than urls_max

    
04.08.2016 / 14:48
4

I set up a very basic example of how I could use Thread (already in the most modern way of working with threads in python, through the module concurrent.futures ).

ATTENTION THE EXAMPLE WAS WRITTEN USING PYTHON 3

import re
from concurrent.futures import ThreadPoolExecutor, as_completed

import requests


HEADERS = {
    'user-agent':
        'Mozilla/5.0 (X11; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0',
    'accept': '*/*',
    'accept-language': 'en-US,en;q=0.5',
    'accept-encoding': 'gzip, deflate',
    'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
}

MAX_WORKERS = 4


def featch_url(url):
    try:
        res = requests.get(url, headers=HEADERS)
    except:
        return url, ''
    return url, res.text


def process_urls(urls):
    result = {}
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
        futures = [executor.submit(featch_url, url) for url in urls]
    for future in as_completed(futures):
        url, html = future.result()
        result[url] = html
    return result


if __name__ == '__main__':
    urls = ['http://www.pudim.com.br/']
    crawled = set()
    while urls:
        to_process = {url for url in urls if url not in crawled}
        print('start process urls: ', to_process)
        process_result = process_urls(to_process)
        urls = []
        for url, page in process_result.items():
            crawled.add(url)
            urls += re.findall(r'(?<=href=["\'])https?://.+?(?=["\'])', page)

    print('Crawled pages: ', crawled)

The highlight of the example is the process_urls function that is responsible for creating the ThreadPoolExecutor and "firing" threads. Obviously the example should be adapted to your needs, because of the way it is, it will only go through all the links that are found in front of it and finally add in the set crawled the pages that have already been processed.

Some remarks

  • No MAX_WORKERS (which consists of the maximum number of threads that will be opened at a time) I used a completely arbitrary number, however the good practice is to use the number of CPUs of the machine * 2 (can be obtained via os.cpu_count() * 2 )
  • If you think about doing some processing in each url (and not only get the links) you can do it within% of%, because you can already process the pages as they are read and not only when they are all read (this will bring you more performance).
  • Before working with Threads, try to understand the side effects of this, ie search for race conditions , locks , etcs.

Although I have done the example with thread (at study / learning level) I recommend that if you need a more robust crawler solution, you should take a look at Scrapy and avoid reinventing the wheel (unless it's to figure out how the wheels work).

    
01.08.2016 / 13:18