Using Rules in Scrapy



This tutorial explains how to use rules in Scrapy.

The Web is a large place with all kinds of different components, sections and subsections. Because of it’s size and complexity, we often need to create a set of Rules for our Scrapy spider to follow while scraping.

Utilizing the Rules class in Scrapy has a wide range of benefits. It allows you to add extra functionality into your Spider, enhance existing features (like requests) and create new possibilities. We’ll be explaining this in more detail in the sections below.


Scrapy Rules – Parameters

The Rules class can take many different parameters, each with it’s own special effect. We’ll be explaining each one of them here individually.

You are not required to pass all of them while using the Rules class, only the ones you wish to use. The rest will use their default values.

link_extractor

This parameter is meant to take a Link extractor object as it’s value. The Link extractor class can do many things related to how links are extracted from a page. Using regex or similar notation, you can deny or allow links which may contain certain words or parts. By default, all links are allowed.

You can learn more about the Link extractor class in a separate tutorial dedicated solely to explaining it.

callback

The value in callback is a callable function that is called upon every link that is extracted by the link extractor. By default, the parse function is called upon every link, but if you wish to create a function with a different name, you can pass it’s name in callback.

cb_kwargs

This parameter takes a dict that contains the keyword arguments that are to be passed to the callback function. (cb stands for callback, kwargs for keyword arguments)

follow

This is a parameter that takes a Boolean value or either True or False. Setting it to True will cause each link that is found in the response, to be followed. You can use the DEPTH_LIMIT setting to add some limitations on to the depth to which these links are followed.

process_links

The process_links takes as a value a callable function. This callable function will be called for every list of links extracted from each response. What you place in the function is upto you, though it is used mostly for filtering purposes.

process_request

Similar to last parameter, this too takes a callable function as it’s value. The callable function must take the request as the first argument, and the response as the second. Furthermore it should return a Request or None object.

The callable function will be called on every single request, allowing you process and customize the requests. Examples of such customizations are cookies and user agents.

errback

Another parameter that takes a callable function as it’s value. The function will be called if there is an error (exception) raised while processing a request.


Rules Examples

These are a few random Rules we picked up from our other tutorials, along with a short explanation.

#1

class SuperSpider(CrawlSpider):
    name = 'books'
    start_urls = ['http://books.toscrape.com/']
    rules = [Rule(LinkExtractor(allow = "chapter"), callback='parse_func', follow = True)]

#2

class SpiderSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(LinkExtractor(allow = 'page/', deny = 'tag/'),
                  process_request = 'request_filter_book', follow=True)]

allowed_domains is a handy setting to ensure that you’re Scrapy spider doesn’t go scraping domains other than the domain(s) you’re targeting. Without this setting, your Spider will follow external links (links which point to other websites) to other domains.


This marks the end of the Scrapy Rules tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial can be asked in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *