This is a tutorial on using the Scrapy Shell.
If you’ve been following our Scrapy tutorial uptil now you’ll have noticed that we have the scrapy output sent to a JSON file which is created in the Scrapy project directory. While this is a pretty popular way of returning and saving your scraped data, there are other techniques as well.
In this tutorial we’ll be explaining how to use the Scrapy “shell” to directly input commands and have the result returned to us on the spot. Using the Scrapy shell is especially useful during debugging or testing phases where you can keep typing in commands instead of having to re-run the whole spider. Think of it like a quick way of testing new selector expressions before including them in the main spider.
Getting Started with the Scrapy Shell
Before we dive into the commands, let’s first set up our environment to use the Scrapy shell. Make sure you have Scrapy installed on your computer before proceeding.
To begin, open up your terminal (or command prompt on Windows) and navigate to the directory where your Scrapy project is located.
Once you are in the project directory, enter the following command:
scrapy shell
This will open up the scrapy shell within our terminal, where we can begin typing unique commands. The very first thing we need to do, is call the fetch
command.
The fetch
command allows you to fetch a URL and return the response object. This is useful when testing selectors and seeing the structure of the HTML page you are working with (for which we need the response object).
Here’s an example:
fetch("https://www.example.com")
This command will fetch the page at the specified URL and return the response object. You can then use selectors (e.g. XPath or CSS Selectors) to extract data from the response object.
Other Shell Commands
Here are some other shells commands you can use:
1. response
The response object represents the HTML page returned from the fetch() command. You can access various properties of the response object, such as the URL, the HTTP status code, and the page content. Here’s an example:
response.urlresponse.status_code
This command will return the URL of the page, and status code of the request sent to the target website.
2. xpath()
The xpath() command allows you to select elements from the HTML page using XPath expressions. Here’s an example:
response.xpath("//h1/text()")
This command will select all the text inside the first H1 tag on the page.
3. css()
The css() command is similar to xpath but uses CSS selectors instead. Here’s an example:
response.css("p::text")
This command will select all the text inside all the P tags on the page.
Using the Scrapy Shell for Local Files
Scrapy shell is not just limited to running commands on live websites. It can also be used to test selectors and XPath expressions on local HTML files.
To use Scrapy shell with a local HTML file, you can pass the file path to the shell command with the file://
protocol. For example, if you have an HTML file named example.html
in the current working directory, you can use the following command to start the shell:
scrapy shell 'file://path/to/example.html'
Replace path/to/example.html
with the actual file path of the HTML file on your machine.
This marks the end of the Scrapy Shell Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.