You can use the AutoThrottle extension that attempts to optimize crawling speed based on in estimates of the server load and processing of Scrapy.
Using this extension ( code here ), you can set a% CONCURRENT_REQUESTS_PER_IP
(or CONCURRENT_REQUESTS_PER_DOMAIN
), and the actual limits will be dynamically defined according to performance measured at run time. The throttling algorithm takes latency into account.
Otherwise, to find a better configuration you will have to test different combinations of competing request limits for IP / domain, download delay and CPU load.
It is difficult to define a recipe to do this manually, because it depends a lot on the type of crawling you are doing. For example, if you are crawling several different sites, you may want to use different settings for each. If you are crawling just one site, you will have to take into account the limitations of requests. And so on, each situation will have to be analyzed separately.
Many sites impose a maximum number of requests per IP per time interval, so it usually makes sense to set CONCURRENT_REQUESTS_PER_IP
and DOWNLOAD_DELAY
, respecting site limitations.