Scrapy Yield - Returning Data

This tutorial explains how to use yield in Scrapy.

You can use regular methods such as printing and logging or using regular file handling methods to save the data returned from the Scrapy Spider. However, Scrapy offers an inbuilt way of saving and storing data through the yield keyword.

In this tutorial we’ll quickly go through how the yield keyword is used in Scrapy.

Yield Keyword

Yield takes only one of the following data types:

Request (Scrapy object)
BaseItem (Scrapy object)
Dict
None

This means that you can’t try passing it a string or integer, else you’ll get an error.

The most common format used with Python Scrapy involves creating a dictionary which we pass into yield. Within the key value pairs of this dictionary is the data that we’ve retrieved using XPath or CSS selectors.

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'URL': quote.xpath('.//a/@href').get(),
                'Quote': quote.xpath('.//p/text()').get(),
            }

As long as you maintain the above format, no matter what kind of data you retrieve with XPath or CSS selectors will be compatible. A Python dictionary can store lists, Arrays, Objects, Strings and more.

Next we’ll run the following command in the terminal of our IDE. (spider is the name of Spider we created in the above example)

scrapy crawl spider -o output.json

Now anything that you passed into the yield keyword will automatically be written to the file output.json when the code is run. And since it’s a JSON file, the output will be in JSON format.

Below are the two top results from the JSON file. The rest we deleted to conserve space. Notice the square brackets in the start and end that show it’s JSON data.

[
{"URL": "/author/Albert-Einstein", "Quote": "\u201cThe world as we have created......"},
{"URL": "/author/J-K-Rowling", "Quote": "\u201cIt is our choices......\u201d"},
]

Remember never to run the same code twice on the same file. If you do, the new JSON data will append on to the previous one and will cause problems due to JSON’s format. Always delete the previous file or use a different file name.

You can also forgo the curly brackets around the yield statement as long as you convert it to an appropriate type before hand. This is demonstrated in the below example.

    def parse(self, response):
        for resp in response.xpath('//a/text()'):
            data = { "data" : resp.get()}
            yield data

Also remember to use to get() or extract() functions on the data before yielding it. If you don’t, there will actually be an error, as the yield keyword will not work with the “selector” objects (which are XPath and CSS expressions).

This marks the end of the Scrapy Yield Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.