How to AutoThrottle a Scrapy Spider

This tutorial explains how to implement AutoThrottle in Scrapy.

Once you’ve moved past the Scrapy Basics, the next step is to begin learning advanced Scrapy techniques to improve your Scrapy bot. One of these many techniques is the AutoThrottle feature in Scrapy.


Why use AutoThrottle?

Crawling the web indiscriminately without any delays in between can put great stress on websites, especially the smaller ones. Even if the website doesn’t mind you extracting data from it, they will get upset if you’re scrapy spider is slowing down the website. In such cases, it’s almost always a good idea to keep the AutoThrottle setting on.

The goal of AutoThrottle is to automatically adjust Scrapy to the ideal crawling speed, so the user doesn’t have to keep adjusting the download delays to find the optimal one. This means that it may increase, or even decrease the delay between requests.

AutoThrottle takes into consideration many different factors like download time, server response, website traffic and load to ensure that the website isn’t swarmed with requests, and our Spider is still able to move reasonably quick.


How to use AutoThrottle in Scrapy

There are two different ways in which you can enable the AutoThrottle setting in your Scrapy Spider(s). You can either to choose to globally enable AutoThrottle in all Spiders in a project, or enable it for each Spider individually (local).

Global

To insert a global setting for your Scrapy spiders, go to the settings.py file and insert the following line.

AUTOTHROTTLE_ENABLED = True

Now all the spiders in your Scrapy project will have AutoThrottle enabled. Keep in mind however that Local settings override global settings. If there is a conflict between the two, the local setting will be picked.

Local

With the Scrapy custom settings feature, you can locally change the settings for a specific Spider, without affecting the others. This code is meant to be inserted in the Class of the spider, in the same place as start_urls.

custom_settings = {
       "AUTOTHROTTLE_ENABLED" : False
}

Remember, even if you’ve enabled AutoThrottle globally, using the above code you can disable it for a certain Spider that may require it.

Scrapy AutoThrottle Example

The below code is from one our example spiders.

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'

    custom_settings = {
        'AUTOTHROTTLE_ENABLED': True,
        'AUTOTHROTTLE_DEBUG': True,
        'DOWNLOAD_DELAY': 1,
        'DEPTH_LIMIT': 1,
    }

AUTOTHROTTLE_DEBUG is a useful setting that prints out alot of helpful information during execution, like the latency and delayed period. DEPTH_LIMIT prevents the Scrapy spider from following links more than a depth of one.

The DOWNLOAD_DELAY setting here ensures that a minimum time of one second is kept between all requests. This can over-ride the AutoThrottle setting if AutoThrottle attempts to launch a request in under a second. Whether you add this or not, depends on your situation, the website and a lot of other factors.


AutoThrottle Settings

There are quite a few different (useful) settings for AutoThrottle besides the ENABLED and DEBUG ones we discussed above.

AUTOTHROTTLE_START_DELAY determines how much of a delay to be kept in sending the first request to the site. You can think of it as the initial download delay. Possible values are 1.0, 2.5, 5.0.

AUTOTHROTTLE_MAX_DELAY determines the maximum amount of time that it should wait before sending another request. Possible values are 7.5, 10, 20.

AUTOTHROTTLE_TARGET_CONCURRENCY determines the average number of requests to be sent in parallel to the target website. Has a default value of 1.


Alternatives to Scrapy AutoThrottle

It’s possible that in very large scraping jobs, or on certain websites, AutoThrottle will significantly increase the time required for the Scraping to complete. If the speed really matters in such an occasion, you can turn to the Download Delay setting instead, and insert it locally into that specific Spider.


This marks the end of the Scrapy AutoThrottle Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments