Developing a WebCrawler in Python [closed]

Question

Developing a WebCrawler in Python [closed]

Navigation

#1 by (3 votes)
#2 by (0 votes)

1

Is there any open source webcrawler project, developed in Python, for study?

I've been studying / researching for some time, but I do not find anything ready about it. My goal is to study to create an open source with the following Features:

Download the HTML of a specific link
Gets the content of specific tags, for example: < p & gt ;, < h1>
Save the contents of the MySQL database

So I would like to have a basis on how to develop this in Python in a simple way. If you have an idea how to do (in code) please give me this help!

obs: My domain in Python is currently basic

python web-service web-crawler

asked by anonymous 03.11.2015 / 06:23

2 answers

Errors with array manipulation in C Pure XML in PHP

score 3 · Answer 1

There are several, from my personal experience:

scrapy - for webscraping
mechanize - for webcrawling
sellenium webdriver - for browser automation (when mechanize is not able to handle the site, eg ajax, obfuscation of code)

Installing the modules is very simple at the command line:

Pip install Scrapy ( documentation )
pip install mechanize ( tutorial )
pip install selenium ( documentation )

score 0 · Answer 2

I recommend studying how the request is made, headers, response headers, user agent, understand how data transport happens.

At the moment of development, always debug at maxime, try to predict everything, timeout, max requests, redirect, if something goes wrong your script has to know, and log.

I recommend studying concurrent.futures.ThreadPoolExecutor for asynchronous request along with

threading.Thread to create database maintenance services and trends, such as how long the site took to be modified, thus automatically adjusting the range of requests according to the probability of site modifications

I recommend link along with xpath and regex to extract the data

I hope it helps.