Creating a delay between requests in Scrapy

This tutorial explains how to create a download delay between requests in Scrapy.

Scraping indiscriminately without any regard for the website you’re scraping on can have consequences. Even if the website doesn’t have a problem with the data being scraped, the extra load your Spider is putting on them will make them angry, possibly resulting in an IP ban for your Scrapy Application, and by extension, you.

Luckily, Scrapy is a pretty advanced and mature framework that is fully equipped to deal with such scenarios. We’ll explain how to deal with such situations, here in this Scrapy tutorial.


Scraping the Web politely

As we mentioned earlier, just letting your Spider loose on websites can get your IP banned. You may not experience this in your early stages, either because your Spiders were too small-scale or you were scraping on sites that were built to be scraped. However, once you begin building advanced crawlers (spiders) this issue becomes very real.

The simple solution is to create a delay or “gap” between the requests that your Scrapy spider sends to the website. This prevents the Spider from overloading the site with requests with little or no delays.

The main reason behind bot detection and banning is that they were overloading and slowing down the site. We can easily create these delays with the DOWNLOAD_DELAY setting in Scrapy.


Requests Delay Example

The DOWNLOAD_DELAY setting can be assigned different integer or float values. For instance, if you were to assign it a value of 2, Scrapy would wait 2 seconds between each response.

Keep in mind that just because the download delay is 2, it doesn’t mean that Scrapy will complete 30 requests in a minute. Besides the download delay, there are other factors like latency and time taken to download the response etc. The value of your Download Delay setting should take into consideration these other factors.

Below is a little example code we pulled from one of our tutorials which used the DOWNLOAD_DELAY setting.

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'

    custom_settings = {
        'DOWNLOAD_DELAY': 1,
    }

All we’ve done is to insert the DOWNLOAD_DELAY setting into custom_settings, allowing it to take effect for this specific spider. In other words, it’s a local setting since it doesn’t effect other spiders in the project.

If you want to learn more about the difference between local and global settings, as well as how to apply settings globally by modifying the settings file, read this Scrapy settings tutorial.


Other settings

DOWNLOAD_DELAY is just one of the “delay” settings for requests in Scrapy. We’ll be discussing a few more similar settings in this section.

RANDOMIZE_DOWNLOAD_DELAY: If set to True, Scrapy will wait a random amount of time while fetching requests from the same website. The formula for calculating the random time is a value between 0.5 and 1.5 multiplied by DOWNLOAD_DELAY. If DOWNLOAD_DELAY is set to 0, this has no effect.


Alternate Techniques

The DOWNLOAD_DELAY setting is just one of many techniques available to mask the presence of your Scrapy Spider. We’ve briefly described and linked other useful techniques that can be used in combination with, or as alternatives to the DOWNLOAD_DELAY setting.

AutoThrottle: Coming up with the optimal delay between requests can be a pretty troublesome task. Luckily, the AutoThrottle setting in Scrapy automatically adjusts the delay based on several factors like latency and traffic.

User Agents: By default Scrapy identifies itself as a Scrapy spider when crawling a website. You can mask the presence of your Scrapy spider by changing the User agent to that of your web browser. That way the website will think it’s your browser accessing it, not Scrapy.


This marks the end of the Scrapy Requests Delay tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments