Concurrent Requests in Scrapy



This tutorial explains how to create Concurrent Requests in Scrapy

While undertaking massive Scraping jobs which may require hundreds of thousands of requests, the total time required can increase drastically. If your Scrapy Spider is taking several hours or maybe even several days, it’s time to consider ways to bring it down to a reasonable amount.

One way to accomplish this is “concurrency”. In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.


Concurrent Requests

Adding concurrency into Scrapy is actually a very simple task. There is already a setting for the number of concurrent requests allowed, which you just have to modify.

You can choose to modify this in the custom settings of the spider you’ve made, or the global settings which effect all spiders.

Global

To add this globally, just add the following line to your settings file.

CONCURRENT_REQUESTS = 30

We’ve set the number of concurrent requests to 30. You may use any value that you wish, within a reasonable limit though.

Local

To add settings locally, we have to use custom settings to add concurrency requests to our Scrapy spider.

custom_settings = {
     'CONCURRENT_REQUESTS' = 30
}

Additional Settings

There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS.

  • CONCURRENT_REQUESTS_PER_IP – Sets the number of concurrent requests per IP address .
  • CONCURRENT_REQUESTS_PER_DOMAIN – Defines the number of concurrent requests allowed for each domain.
  • MAX_CONCURRENT_REQUESTS_PER_DOMAIN – Sets a maximum limit on the number of concurrent requests allowed for a domain.

Additional Notes

Note: It’s possible that the site you’re trying to scrape has a limit built in for the number of concurrent requests allowed per IP, which negates the Scrapy concurrent settings. However, there is a way to get around this. All you have to do is using rotating proxies in Scrapy to get a new IP with each request.

Note: Keep in mind that if you’ve created a delay between requests, this may reduce the effectiveness of concurrent requests in scrapy by putting delays between them. To fix this problem, set the download delay to a lower value. This will make your bot more noticeable though so also consider using proxies.


This marks the end of the Concurrent Requests in Scrapy tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *