This tutorial explains the use of the get and extract methods in Scrapy.
Scrapy has two main methods used to “extract” or “get” data from the elements that it pulls of the web sites. They are called extract
and get
. extract
is actually the older method, while get
was released as the new successor to extract
.
With the introduction of the get
method, Scrapy usage docs are now written using .get()
and .getall()
methods. The reason being that these new methods result in a more concise and readable code.
The get method
The usage of both methods is the exact same. To start off, we’ll simply explain the usage using the more popular (and recommended) version, the get
method.
1 2 3 4 5 6 | class SuperSpider(CrawlSpider): name = 'tester' start_urls = [ 'https://en.wikipedia.org/wiki/Web_scraping' ] def parse( self , response): .... |
We’ll be using the above class and URL for our Scrapy bot in the below examples. To keep the number of repeated lines down, we’ll only be writing the parse
function.
We’ll be using XPath, since it has more functionality and is recommended for complex expressions. If you wish however, you can learn how to do it using CSS Selectors too.
get method
The get()
method is used on the response object to “get” the first result. The yield keyword writes the returned data to a JSON file, in the form of a dictionary.
1 2 3 4 5 | def parse( self , response): resp = response.xpath( '//h3/a/text()' ) yield { "test" : resp.get() } |
As you can see below, only one value has been printed.
[
{"test": "A Light in the ..."}
]
As you’ll see in the example below, there was more than just one book title on the web page. However, the get
only returned the first one.
get_all method
The getall()
method is used on the response object to “get” all of the returned results.
1 2 3 4 5 | def parse( self , response): resp = response.xpath( '//h2/text()' ) yield { "test" : resp.getall() } |
As expected, all 20 book titles from the web page have been retrieved and returned into our JSON file. We omitted the second half of the results in the output below.
[
{"test": ["A Light in the ...",
"Tipping the Velvet",
"Soumission",
"Sharp Objects",
"Sapiens: A Brief History ...",
"The Requiem Red",
"The Dirty Little Secrets ...",
"The Coming Woman: A ...",
"The Boys in the ...",
"The Black Maria",
...
]
Get and Extract Comparison
This section covers a little comparison between the extract
and get
methods. Just to prove that both have the exact same usage.
In the examples below, we’ll be using the Scrapy get
and extract
methods on the same response object and check out the output.
First we’ll try the get
and extract_first
methods.
1 2 3 4 5 6 | def parse( self , response): resp = response.xpath( '//h3/a/text()' ) yield { "get" : resp.get(), "extract" : resp.extract_first() } |
Comparing the two outputs below, we can see that they are identical.
[
{"get": "A Light in the ...",
"extract": "A Light in the ..."}
]
Next is the getall
and extract
methods.
1 2 3 4 5 6 | def parse( self , response): resp = response.xpath( '//h3/a/text()' ) yield { "get" : resp.getall(), "extract" : resp.extract() } |
As expected, the outputs of both methods is the same in every manner.
"get": ["A Light in the...", "Tipping the Velvet", "Soumission",.....
"extract": ["A Light in the ...", "Tipping the Velvet", "Soumission"......
In short, getall()
is the same as extract()
, whereas get()
is the same as extract_first()
.
This marks the end of the Scrapy “get” and extract” methods tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article can be asked in the comments section below.