Scrapy Basics - CodersLegacy

This tutorial explains the fundamental basics behind Scrapy

If you’ve come through our Scrapy Tutorial and Installation Guide, you’ll know what Scrapy is and what it’s used for. This tutorial focuses mainly on building a few core concepts and explains certain thing you’ll need to know before beginning your very own Scrapy Project.

Scrapy is a large and complex library, which makes it hard to just jump directly in with no prior knowledge. Hence why we’ve begun with the basics in Scrapy first.

Make sure you’ve read our Project creation guide for Scrapy before continuing. The below tutorial assumes you have Scrapy set up and ready.

Crawl Spider Class

One of the things that remains constant in almost every Scrapy program is the CrawlSpider class and the parse function. The name of the class isn’t very important, but it must be created using CrawlSpider as shown below.

You can import CrawlSpider from the Scrapy.spiders module.

from scrapy.spiders import CrawlSpider


class SuperSpider(CrawlSpider):

    def parse(self, response):
        ....

The parse function is the function called to “parse” through the response objects returned from URLs. Even if you don’t call it yourself, Scrapy will do so automatically. For the sake of keeping things simple, we’ll leave it to Scrapy.

Scrapy Crawler Name

Every crawler in Scrapy must have a unique name. This name is used when it’s being called through the terminal. You cannot execute a Scrapy bot without using it’s name.

All you have to do is create a variable called name, and leave it in the Class you defined. Scrapy will automatically recognize the variable called name and assign it accordingly.

class SuperSpider(CrawlSpider):
    name = 'extractor'

    def parse(self, response):
         ....

In the above code, we’ve named our Scrapy bot “extractor”.

Run a Scrapy Spider

In order to run (execute) a Scrapy bot, you need to use the command terminal. If you’re using an IDE like PyCharm you can look in the bottom right corner for the terminal. It resembles the command prompt that you can use on your desktop.

Before you begin using any commands, you need to make sure you are in the right directory. If you created a project called “tutorial” then your command terminal must be executing within the tutorial project directory.

C:\Users\CodersLegacy\PycharmProjects\Scrapy_proj\tutorial>

This is what our project directory looks like. Sometimes your directory may be set to the PyCharm project, Scrapy_proj, instead of the Scrapy project, tutorial. Be sure to change the directory to the scrapy project instead using the “change directory” command, cd/.

For most use cases, the crawl command is more than enough. (The crawl command will only work if you’re doing it in the right directory)

C:\Users\.....\tutorial> scrapy crawl extractor -o extractor_output.json

The above code calls the Scrapy with the name of extractor, and outputs (-o) the acquired data into a file called extractor_output.json. This JSON file is stored in the tutorial folder.

URLs and Domains

Scrapy, as mentioned earlier has a large number of features. Besides the name variable, there are other variables which carry great importance in Scrapy. See the image below.

class SuperSpider(CrawlSpider):
    name = 'tester'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'

You can see three new variables, allowed_domains, start_urls and base_url. Remember, do not try to change the names of these variables. They will lose all meaning and purpose if you do.

Allowed Domains

allowed_domains is a variable that Scrapy checks when following links. Let’s say you are trying to scrape a website completely. Naturally you’re Spider is going keep following links to get from one web page to another. But how to stop it from following links to other sites?

All you have do is create a list with the names of the domains that you wish to be crawled and assign it to allowed_domains. Scrapy will ignore any links that aren’t from the domains in allowed_domains.

Start URLs

The most commonly used variable, which is present in almost all Scrapy bots. This variable holds the URL or list of URLs which are to be scraped.

This is useful when you haven’t added any link following capabilities to your Scrapy bot or just want a few specific pages scraped.

Any URL in this variable is run through a function automatically and a response object is generated. This response object is then passed into the parse function automatically.

Base URL

Not very important, but handy in certain situations. This holds the “base url” of the website you’re aiming to scrape. Helps in completing internal URLs into full proper URLs.

This marks the end of the Scrapy Basics Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.