Scrapy Items Tutorial: Grouping Data

Scrapy is a powerful Python framework used for web scraping. It provides a convenient way to extract data from websites, and one of its most useful features is the Scrapy Items. In this tutorial, we will explore Scrapy Items and their benefits.

What are Scrapy Items?

Scrapy Items are not only containers for the data that you want to scrape, but they also help you group relevant data together.

Let’s say you are scraping a website for information about books. Each book has a title, author, price, and rating. Instead of creating separate variables for each piece of information, you can define a Scrapy Item with these fields, and Scrapy will extract the data and populate the fields. This way, you have a clear and structured representation of the data that makes it easier to work with later on.

Benefits of Using Scrapy Items

Scrapy Items provide several benefits when it comes to web scraping. Here are some of the key benefits:

Structured Data: Scrapy Items allow you to define the structure of the data that you want to extract from a website. This makes it easier to work with the data later on, and ensures that your data is well-organized and easy to understand.
Consistency: When you define the structure of your Scrapy Item, you ensure that you always get the same data from each web page. This makes it easier to write your code and ensures that your data is consistent.
Reusability: Once you have defined a Scrapy Item, you can reuse it in multiple spiders. This saves time and ensures that your code is more maintainable.

How to Define a Scrapy Item

Defining a Scrapy Item is easy. First, you need to create a new Python file and import the scrapy module. Then, create a new class that inherits from scrapy.Item. Finally, define the fields of the Scrapy Item using scrapy.Field.

Here’s an example of a Scrapy Item that defines the fields for a book:

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()

In this example, we define a Scrapy Item with four fields: title, author, price, and rating.

How to Use Scrapy Items in a Spider

Once you have defined a Scrapy Item, you can use it in your spider to extract data from a website. In your spider, you create an instance of your Scrapy Item and populate its fields with data from the website.

Here’s an example of how to use a Scrapy Item in a spider:

import scrapy
from myproject.items import BookItem

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        books = response.xpath("//article[@class='product_pod']")

        for book in books:
            title = book.xpath("./h3/a/@title").get()
            author = book.xpath("./div/p[@class='author']/a/@title").get()
            price = book.xpath("./div[@class = 'product_price']/p[@class = 'price_color']/text()").get()
            rating = book.xpath("./p[ contains(@class,  'star-rating')]/@class").get() 

            book_item = BookItem()
            book_item['title'] = title
            book_item['author'] = author
            book_item['price'] = price
            book_item['rating'] = rating

            yield book_item

In this example, we create an instance of the BookItem and populate its fields with data extracted from the website. Although this has not added any extra functionality, this will help make our code more scalable and compatible with other features of scrapy.

For example, Items are popularly used in “Scrapy Pipelines”, where an item is passed to a pipeline for further data processing, cleaning, and filtering. Follow the link to learn how we can use pipelines in scrapy.

Don’t forget to use the get() function, otherwise the wrong type of data will be saved to the item!

This marks the end of the Scrapy Items tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.