Developing a WebCrawler in Python [closed]

1

Is there any open source webcrawler project, developed in Python, for study?

I've been studying / researching for some time, but I do not find anything ready about it. My goal is to study to create an open source with the following Features:

  • Download the HTML of a specific link
  • Gets the content of specific tags, for example: < p & gt ;, < h1>
  • Save the contents of the MySQL database

So I would like to have a basis on how to develop this in Python in a simple way. If you have an idea how to do (in code) please give me this help!

obs: My domain in Python is currently basic

    
asked by anonymous 03.11.2015 / 06:23

2 answers

3

There are several, from my personal experience:

Installing the modules is very simple at the command line:

03.11.2015 / 12:16
0

I recommend studying how the request is made, headers, response headers, user agent, understand how data transport happens.

At the moment of development, always debug at maxime, try to predict everything, timeout, max requests, redirect, if something goes wrong your script has to know, and log.

I recommend studying concurrent.futures.ThreadPoolExecutor for asynchronous request along with

threading.Thread to create database maintenance services and trends, such as how long the site took to be modified, thus automatically adjusting the range of requests according to the probability of site modifications

I recommend link along with xpath and regex to extract the data

I hope it helps.

    
07.11.2015 / 01:04