Scrapy – get() and extract() functions

This tutorial explains the use of the get and extract methods in Scrapy.

Scrapy has two main methods used to “extract” or “get” data from the elements that it pulls of the web sites. They are called extract and get. extract is actually the older method, while get was released as the new successor to extract.

With the introduction of the get method, Scrapy usage docs are now written using .get() and .getall() methods. The reason being that these new methods result in a more concise and readable code.


The get method

The usage of both methods is the exact same. To start off, we’ll simply explain the usage using the more popular (and recommended) version, the get method.

1
2
3
4
5
6
class SuperSpider(CrawlSpider):
    name = 'tester'
    start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
 
    def parse(self, response):
           ....

We’ll be using the above class and URL for our Scrapy bot in the below examples. To keep the number of repeated lines down, we’ll only be writing the parse function.

We’ll be using XPath, since it has more functionality and is recommended for complex expressions. If you wish however, you can learn how to do it using CSS Selectors too.

get method

The get() method is used on the response object to “get” the first result. The yield keyword writes the returned data to a JSON file, in the form of a dictionary.

1
2
3
4
5
def parse(self, response):
    resp = response.xpath('//h3/a/text()')
    yield {
        "test" : resp.get()
    }

As you can see below, only one value has been printed.

[
{"test": "A Light in the ..."}
]

As you’ll see in the example below, there was more than just one book title on the web page. However, the get only returned the first one.

get_all method

The getall() method is used on the response object to “get” all of the returned results.

1
2
3
4
5
def parse(self, response):
    resp = response.xpath('//h2/text()')
    yield {
        "test" : resp.getall()
    }

As expected, all 20 book titles from the web page have been retrieved and returned into our JSON file. We omitted the second half of the results in the output below.

[
{"test": ["A Light in the ...", 
          "Tipping the Velvet", 
          "Soumission", 
          "Sharp Objects", 
          "Sapiens: A Brief History ...", 
          "The Requiem Red", 
          "The Dirty Little Secrets ...", 
          "The Coming Woman: A ...", 
          "The Boys in the ...", 
          "The Black Maria", 
           ...
]

Get and Extract Comparison

This section covers a little comparison between the extract and get methods. Just to prove that both have the exact same usage.

In the examples below, we’ll be using the Scrapy get and extract methods on the same response object and check out the output.

First we’ll try the get and extract_first methods.

1
2
3
4
5
6
def parse(self, response):
    resp = response.xpath('//h3/a/text()')
    yield {
        "get" : resp.get(),
        "extract": resp.extract_first()
    }

Comparing the two outputs below, we can see that they are identical.

[
{"get": "A Light in the ...", 
"extract": "A Light in the ..."}
]

Next is the getall and extract methods.

1
2
3
4
5
6
def parse(self, response):
    resp = response.xpath('//h3/a/text()')
    yield {
        "get" : resp.getall(),
        "extract": resp.extract()
    }

As expected, the outputs of both methods is the same in every manner.

"get": ["A Light in the...", "Tipping the Velvet", "Soumission",.....
"extract": ["A Light in the ...", "Tipping the Velvet", "Soumission"......

In short, getall() is the same as extract(), whereas get() is the same as extract_first().


This marks the end of the Scrapy “get” and extract” methods tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article can be asked in the comments section below.

Subscribe
Notify of
guest


0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments