This tutorial explains the fundamental basics behind Scrapy
If you’ve come through our Scrapy Tutorial and Installation Guide, you’ll know what Scrapy is and what it’s used for. This tutorial focuses mainly on building a few core concepts and explains certain thing you’ll need to know before beginning your very own Scrapy Project.
Scrapy is a large and complex library, which makes it hard to just jump directly in with no prior knowledge. Hence why we’ve begun with the basics in Scrapy first.
Make sure you’ve read our Project creation guide for Scrapy before continuing. The below tutorial assumes you have Scrapy set up and ready.
Crawl Spider Class
One of the things that remains constant in almost every Scrapy program is the
CrawlSpider class and the
parse function. The name of the class isn’t very important, but it must be created using
CrawlSpider as shown below.
You can import
CrawlSpider from the
from scrapy.spiders import CrawlSpider class SuperSpider(CrawlSpider): def parse(self, response): ....
parse function is the function called to “parse” through the response objects returned from URLs. Even if you don’t call it yourself, Scrapy will do so automatically. For the sake of keeping things simple, we’ll leave it to Scrapy.
Scrapy Crawler Name
Every crawler in Scrapy must have a unique name. This name is used when it’s being called through the terminal. You cannot execute a Scrapy bot without using it’s name.
All you have to do is create a variable called
name, and leave it in the Class you defined. Scrapy will automatically recognize the variable called
name and assign it accordingly.
class SuperSpider(CrawlSpider): name = 'extractor' def parse(self, response): ....
In the above code, we’ve named our Scrapy bot “extractor”.
Run a Scrapy Spider
In order to run (execute) a Scrapy bot, you need to use the command terminal. If you’re using an IDE like PyCharm you can look in the bottom right corner for the terminal. It resembles the command prompt that you can use on your desktop.
Before you begin using any commands, you need to make sure you are in the right directory. If you created a project called “tutorial” then your command terminal must be executing within the tutorial project directory.
This is what our project directory looks like. Sometimes your directory may be set to the PyCharm project,
Scrapy_proj, instead of the Scrapy project,
tutorial. Be sure to change the directory to the scrapy project instead using the “change directory” command,
For most use cases, the crawl command is more than enough. (The crawl command will only work if you’re doing it in the right directory)
C:\Users\.....\tutorial> scrapy crawl extractor -o extractor_output.json
The above code calls the Scrapy with the name of extractor, and outputs (
-o) the acquired data into a file called
extractor_output.json. This JSON file is stored in the tutorial folder.
URLs and Domains
Scrapy, as mentioned earlier has a large number of features. Besides the name variable, there are other variables which carry great importance in Scrapy. See the image below.
class SuperSpider(CrawlSpider): name = 'tester' allowed_domains = ['quotes.toscrape.com'] start_urls = ['http://quotes.toscrape.com/'] base_url = 'http://quotes.toscrape.com'
You can see three new variables,
base_url. Remember, do not try to change the names of these variables. They will lose all meaning and purpose if you do.
allowed_domains is a variable that Scrapy checks when following links. Let’s say you are trying to scrape a website completely. Naturally you’re Spider is going keep following links to get from one web page to another. But how to stop it from following links to other sites?
All you have do is create a list with the names of the domains that you wish to be crawled and assign it to
allowed_domains. Scrapy will ignore any links that aren’t from the domains in
The most commonly used variable, which is present in almost all Scrapy bots. This variable holds the URL or list of URLs which are to be scraped.
This is useful when you haven’t added any link following capabilities to your Scrapy bot or just want a few specific pages scraped.
Any URL in this variable is run through a function automatically and a response object is generated. This response object is then passed into the
parse function automatically.
Not very important, but handy in certain situations. This holds the “base url” of the website you’re aiming to scrape. Helps in completing internal URLs into full proper URLs.
This marks the end of the Scrapy Basics Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.