Scrapy Item Pipelines and Processing

One of the most powerful features of Scrapy is the ability to use item pipelines to process scraped data. In this tutorial, we will explore how to use item pipelines and processing in Scrapy.

Note: There is a video version of this tutorial located at the bottom of the article, if you are interested.

Introduction to Scrapy Item Pipelines

Item pipelines are a series of Python classes that are used to process data after it has been scraped by a spider. Each item pipeline is responsible for performing a specific action on the scraped data. For example, an item pipeline could clean and validate data, store it in a database, or send it to an API.

When a spider scrapes data, it generates Scrapy items that contain the scraped data. These items are then passed through the item pipelines. Each item pipeline can modify or filter the items, and then pass them on to the next pipeline in the chain. Once all the pipelines have processed the items, they are considered complete.

Creating a Scrapy Item Pipeline

To create a Scrapy item pipeline, we need to create a Python class that implements the process_item() method. This method is called for each item that is scraped by the spider. It takes two arguments: the item and the spider that scraped the item.

Let’s take a look at an example Scrapy item pipeline that processes book items scraped from a website. Here is some of the scraped data (the code for the spider will be included at the end).

Here are three records that we have scraped from a website. As you can see, the rating field is a string with the format “star-rating” + the actual rating (in words). This isn’t very ideal for us, so we will create a pipeline to process this data, and convert it into a numeric form.

{"title": "A Light in the Attic", "price": "\u00a351.77", "rating": "star-rating Three"},
{"title": "Tipping the Velvet", "price": "\u00a353.74", "rating": "star-rating One"},
{"title": "Soumission", "price": "\u00a350.10", "rating": "star-rating One"},

Here is the pipeline.

class TutorialPipeline:
    def process_item(self, item, spider):

        rating = item["rating"]
        temp = rating.split(" ")[1]

        if temp == "One":
            item["rating"] = 1
        if temp == "Two":
            item["rating"] = 2
        if temp == "Three":
            item["rating"] = 3
        if temp == "Four":
            item["rating"] = 4
        if temp == "Five":
            item["rating"] = 5

        return item

This pipeline processes the “rating” field of book items scraped by the spider. It converts the rating from a string to an integer by extracting the number from the “star-rating” class.

Enabling a Scrapy Item Pipeline

To use a Scrapy item pipeline, we need to add it to the ITEM_PIPELINES setting in the Scrapy settings.py file. This setting is a dictionary that maps pipeline class names to their priority order.

Here’s an example ITEM_PIPELINES setting that uses the TutorialPipeline class we created above:

ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300,
}

The keys in the ITEM_PIPELINES dictionary are the names of the pipeline classes, and the values are their priority order. The priority order is an integer that specifies the order in which the pipelines should be executed. The lower the number, the earlier the pipeline will be executed.

In this example, we’ve set the priority order of TutorialPipeline to 300. This means that it will be executed after any pipelines with a lower priority order.

Re-scraping the Target website with the Item Pipeline

Now we will execute our code again (without making any changes to the spider). Here is the spider code, to give you a better idea of what is going on. Try it out for yourself if you want.

import scrapy
from tutorial.items import BookItem
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule

class BookSpider(scrapy.Spider):
    name = 'Book'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    rules = [Rule(LinkExtractor(allow = ""))]
    def parse(self, response):
        books = response.xpath("//article[@class='product_pod']")

        for book in books:
            title = book.xpath("./h3/a/@title").get()
            price = book.xpath("./div[@class = 'product_price']/p[@class = 'price_color']/text()").get()
            review = book.xpath("./p[ contains(@class,  'star-rating')]/@class").get() 

            # print(title, price, review)
            bookItem = BookItem()
            bookItem["title"] = title
            bookItem["price"] = price
            bookItem["rating"] = review

            yield bookItem

Here are the new and updated records from running the above spider. As you can see, they are now holding numeric values in the “rating” field.

{"title": "A Light in the Attic", "price": "\u00a351.77", "rating": 3},
{"title": "Tipping the Velvet", "price": "\u00a353.74", "rating": 1},
{"title": "Soumission", "price": "\u00a350.10", "rating": 1},

Mission accomplished.

If you are interested in learning how the above spider works, follow this link to learn about “Link following in Scrapy“.

This marks the end of the Scrapy Item Pipelines and Processing tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in comments section below.