How to calculate an optimal value for the Scrapyd variable CONCURRENT_REQUESTS?

Question

How to calculate an optimal value for the Scrapyd variable CONCURRENT_REQUESTS?

Navigation

#1 by (2 votes)
#2 by (0 votes)

4

One of the settings that comes standard with Scrapyd is the number of concurrent processes (it is 16).

CONCURRENT_REQUESTS = 16

What would be the best methodology to calculate an optimal value for this variable?

The goal is to get the best performance against processing and memory usage versus number of crawled pages.

python web-crawler scrapy

asked by anonymous 09.01.2015 / 21:01

2 answers

0

The number can vary greatly from server to server. And, if you're creating a generic spider to index multiple sites it gets even harder. Usually I'm increasing until I can get my CPU to be close to 100% in one of the colors. So I can make sure the algorithm has become limited by the CPU and not over the network.

If you want to reduce the memory used, simply turn on the JOBDIR setting that memory consumption will stop growing.

I was not successful with AutoThrottle . The extension seemed very simple and most of the time the speed is not much below the optimum speed.

02.04.2016 / 10:19

How to call a new terminal from a shell script? Problems with JPA - Hibernate does not persist EVERY object

score 2 · Accepted Answer

You can use the AutoThrottle extension that attempts to optimize crawling speed based on in estimates of the server load and processing of Scrapy.

Using this extension ( code here ), you can set a% CONCURRENT_REQUESTS_PER_IP (or CONCURRENT_REQUESTS_PER_DOMAIN ), and the actual limits will be dynamically defined according to performance measured at run time. The throttling algorithm takes latency into account.

Otherwise, to find a better configuration you will have to test different combinations of competing request limits for IP / domain, download delay and CPU load.

It is difficult to define a recipe to do this manually, because it depends a lot on the type of crawling you are doing. For example, if you are crawling several different sites, you may want to use different settings for each. If you are crawling just one site, you will have to take into account the limitations of requests. And so on, each situation will have to be analyzed separately.

Many sites impose a maximum number of requests per IP per time interval, so it usually makes sense to set CONCURRENT_REQUESTS_PER_IP and DOWNLOAD_DELAY , respecting site limitations.