This article explains how to create your custom settings in Scrapy.
A large framework like Scrapy has hundreds of different settings which decide it’s scraping behavior. Scrapy gives us the ability to access and change each one of these settings to suit our requirements.
In this tutorial we’ll be focusing on how to add and adjust these Scrapy settings in a variety of different manners.
Settings File
Before we move on to Custom settings, we’ll briefly explain the purpose of settings.py
in your Scrapy project and the difference between local and global settings.
Local settings are those which only effect the Spider in which they are placed. Global settings are those which when placed once, effect all the spiders within the entire scrapy project. (One project may hold several spiders). Any setting that you place within the settings.py
is a global setting.
If you want to add a global setting, all you have to do is add the appropriate line, anywhere within the settings.py
file. Some sample settings are given below.
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_DEBUG = True
DOWNLOAD_DELAY = 1
Adding Custom Settings
custom_settings
must be created as a class attribute within the Spider Class. It’s placed alongside other similar attributes like start_urls
.
Here’s an example from one of our tutorials.
class SuperSpider(CrawlSpider):
name = 'spider'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
base_url = 'http://quotes.toscrape.com'
custom_settings = {
'AUTOTHROTTLE_ENABLED': True,
'AUTOTHROTTLE_DEBUG': True,
'DOWNLOAD_DELAY': 1,
'DEPTH_LIMIT': 1,
}
As you can, there are many different settings that we’ve added into our custom_settings attribute. All these settings will only be affecting this specific Spider, not the other ones in the project.
List of settings in Scrapy
Now, there are obviously too many settings here to list down and explain, so we’ll only be covering the important and most used ones.
AUTOTHROTTLE_ENABLED
is used to enable or disable the AutoThrottle feature, depending on the Boolean value assigned to it. Default value of False.
DEPTH_LIMIT
is used to set the depth to which Scrapy will keep following links.
DOWNLOAD_DELAY
represent the delay between each request that Scrapy sends out.
CONCURRENT_REQUESTS
determines the maximum number of simultaneous requests that Scrapy will send out.
DOWNLOAD_MAXSIZE
determines the maximum possible size of the downloaded response by Scrapy.
ITEM_PIPELINES
represents the place where items that are scrapped (like files or images) are sent.
USER_AGENT
is a kind of a string which stores information about the browser being used to access a site. You’ll have to use this setting to mask the presence of your spider from websites.
IMAGES_STORE
is used to determine where Scrapy will store images that it scrapes of the internet. Requires ITEM_PIPELINES
for images to be set up as well.
REDIRECT_MAX_TIMES
determines the maximum number of redirects that a single request may make. Once the limit has been reached, the response is returned as it was on the last try.
This marks the end of the Scrapy (Custom) Settings Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article can be asked in the comments section below.