This tutorial explains how to use rules in Scrapy.
The Web is a large place with all kinds of different components, sections and subsections. Because of it’s size and complexity, we often need to create a set of Rules for our Scrapy spider to follow while scraping.
Utilizing the Rules class in Scrapy has a wide range of benefits. It allows you to add extra functionality into your Spider, enhance existing features (like requests) and create new possibilities. We’ll be explaining this in more detail in the sections below.
Scrapy Rules – Parameters
The Rules class can take many different parameters, each with it’s own special effect. We’ll be explaining each one of them here individually.
You are not required to pass all of them while using the Rules class, only the ones you wish to use. The rest will use their default values.
This parameter is meant to take a Link extractor object as it’s value. The Link extractor class can do many things related to how links are extracted from a page. Using regex or similar notation, you can
allow links which may contain certain words or parts. By default, all links are allowed.
You can learn more about the Link extractor class in a separate tutorial dedicated solely to explaining it.
The value in callback is a callable function that is called upon every link that is extracted by the link extractor. By default, the
parse function is called upon every link, but if you wish to create a function with a different name, you can pass it’s name in
This parameter takes a dict that contains the keyword arguments that are to be passed to the callback function. (cb stands for callback, kwargs for keyword arguments)
This is a parameter that takes a Boolean value or either
False. Setting it to
True will cause each link that is found in the response, to be followed. You can use the DEPTH_LIMIT setting to add some limitations on to the depth to which these links are followed.
process_links takes as a value a callable function. This callable function will be called for every list of links extracted from each response. What you place in the function is upto you, though it is used mostly for filtering purposes.
Similar to last parameter, this too takes a callable function as it’s value. The callable function must take the request as the first argument, and the response as the second. Furthermore it should return a
The callable function will be called on every single request, allowing you process and customize the requests. Examples of such customizations are cookies and user agents.
Another parameter that takes a callable function as it’s value. The function will be called if there is an error (exception) raised while processing a request.
For more information and practical examples, check out our video tutorial on creating rules in scrapy.
These are a few random Rules we picked up from our other tutorials, along with a short explanation.
class SuperSpider(CrawlSpider): name = 'books' start_urls = ['http://books.toscrape.com/'] rules = [Rule(LinkExtractor(allow = "chapter"), callback='parse_func', follow = True)]
allow parameter is set to “chapter”, which means that the spider will only follow links that contain the string “chapter” in their URLs.
class SpiderSpider(CrawlSpider): name = 'spider' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] base_url = 'http://quotes.toscrape.com' rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'), process_request = 'request_filter_book', follow=True)]
allowed_domains is a handy setting to ensure that you’re Scrapy spider doesn’t go scraping domains other than the domain(s) you’re targeting. Without this setting, your Spider will follow external links (links which point to other websites) to other domains.
This marks the end of the Scrapy Rules tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial can be asked in the comments section below.