This tutorial explains how to create Concurrent Requests in Scrapy
While undertaking massive Scraping jobs which may require hundreds of thousands of requests, the total time required can increase drastically. If your Scrapy Spider is taking several hours or maybe even several days, it’s time to consider ways to bring it down to a reasonable amount.
One way to accomplish this is “concurrency”. In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.
Concurrent Requests
Adding concurrency into Scrapy is actually a very simple task. There is already a setting for the number of concurrent requests allowed, which you just have to modify.
You can choose to modify this in the custom settings of the spider you’ve made, or the global settings which effect all spiders.
Global Settings
To add this globally, just add the following line to your settings file (anywhere).
CONCURRENT_REQUESTS = 30
We’ve set the number of concurrent requests to 30. You may use any value that you wish, within a reasonable limit though. For reference, the default value of this setting is usually either 8, or 16.
Local Settings
To add settings locally, we have to use custom settings to add concurrency requests to our Scrapy spider. Shown below is a simple example.
import scrapy
class BookSpider(scrapy.Spider):
name = 'Book'
allowed_domains = ['books.toscrape.com']
start_urls = ['http://books.toscrape.com/']
custom_settings = {
'CONCURRENT_REQUESTS' : 30
}
def parse(self, response):
pass
Additional Settings
There are many additional settings that you can use instead of, or together with CONCURRENT_REQUESTS
.
CONCURRENT_REQUESTS_PER_IP
– Sets the number of concurrent requests per IP address. Useful when you are using multiple IPs.CONCURRENT_REQUESTS_PER_DOMAIN
– Defines the number of concurrent requests allowed for each domain that you wish to scrape.MAX_CONCURRENT_REQUESTS_PER_DOMAIN
– Sets a maximum limit on the number of concurrent requests allowed for a domain.
Additional Notes
Note: It’s possible that the site you’re trying to scrape has a limit built in for the number of concurrent requests allowed per IP, which negates the Scrapy concurrent settings. However, there is a way to get around this. All you have to do is using rotating proxies in Scrapy to get a new IP with each request.
Note: Keep in mind that if you’ve created a delay between requests, this may reduce the effectiveness of concurrent requests in scrapy by putting delays between them. To fix this problem, set the download delay to a lower value. This will make your bot more noticeable though so also consider using proxies.
This marks the end of the Concurrent Requests in Scrapy tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.