Scrapy Link Extractors | Extracting Data

This a tutorial on link extractors in Python Scrapy

In this Scrapy tutorial we’ll be focusing on creating a Scrapy bot that can extract all the links from a website. The program that we’ll be creating is more than just than a link extractor, it’s also a link follower.

It’s easy enough to extract all the links from a single certain page, but it’s much harder to scrape links from an entire website. That’s what we’ll be focusing on in this article.

This tutorial will also be featuring the Link Extractor and Rule Classes, used to add extra functionality into your Scrapy bot.

Selecting a Website for Scraping

It’s important to scope out the websites that you’re going to scrape, you can’t just go in blindly. You need to know the HTML layout so you can extract data from the right elements.

In our case, the website that we’re going to scrape is called Quotes to Scrape, a site designed specifically to be scraped by Scrapy practitioners.

You can examine the HTML layout of the website by either using the inbuilt Inspect tool in your browser (right click on screen -> Inspect) or directly going to the page source ( right click -> Page source).

The reason we know exactly what to be putting in each expression is because we individually examined the HTML code for each piece of information we wanted to extract. If you want to learn more about how to use XPath or CSS Selectors to create this expressions, follow the inks to their tutorials.

Scrapy – Extracting Data Example

This name variable that we declared is the name by which our spider will be called when executing the program. This name should be unique, especially in the case where there are multiple spiders.

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(callback='parse_func', follow=True)]

    def parse_func(self, response):
        for quote in response.css('div.quote'):
            yield {
                'Author': quote.xpath('.//span/a/@href').extract_first(),
                'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),
            }

If you take a close look at the output of the above code, you’ll notice that there are a few duplicated records. This is because our spider has crawled the entire site without discrimination.

Three records we picked at random to show to duplicated effect. In total, 400+ quotes we returned, 4 times the amount that there’s supposed to be (100).

{"Author": "/author/George-R-R-Martin", "Quote": "\u201cA reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.\u201d"},
{"Author": "/author/George-R-R-Martin", "Quote": "\u201cA reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.\u201d"},
{"Author": "/author/George-R-R-Martin", "Quote": "\u201cA reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.\u201d"},

Scrapy – Link Extractor

Upon closer inspection of the site, you will realize that there are two ways they have categorized (sorted) the quotes. First by pages, second through the use of tags. The current problem is that the spider is scrapping both, resulting in many duplicated records, especially as some quotes appear in multiple tags.

Since we want pure, unique records we’re going to make some changes. You could always try filtering through the output yourself later using Python code, but Scrapy can do it better and faster using Rules and the Link Extractor.

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(LinkExtractor(allow = 'page/', deny='tag/'),
                  callback='parse_filter_book', follow=True)]

    def parse_filter_book(self, response):
        for quote in response.css('div.quote'):
            yield {
                'Author': quote.xpath('.//span/a/@href').get(),
                'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),

Here are 3 records we picked at random from the first 10 records retrieved from the page. Also, the total number of quotes was only a 100, as expected.

{"Author": "/author/Dr-Seuss", "Quote": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d"},
{"Author": "/author/Douglas-Adams", "Quote": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d"},
{"Author": "/author/Elie-Wiesel", "Quote": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d"},

Start URLs

Continuing this line of thought, we could have used an alternate method where we simply listed the URLs we wanted to scrape. This technique will work on this site because it only has 10 pages with predictable names.

We’ve removed alot of code, instead just keeping the start_urls, and writing down the urls of the first 5 pages in there.

class SuperSpider(CrawlSpider):
    name = 'spider'
    start_urls = ['http://quotes.toscrape.com/page/1/',
                  'http://quotes.toscrape.com/page/2/',
                  'http://quotes.toscrape.com/page/3/',
                  'http://quotes.toscrape.com/page/4/',
                  'http://quotes.toscrape.com/page/5/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'Author': quote.xpath('.//span/a/@href').extract_first(),
                'Quote': quote.xpath('.//span[@class= "text"]/text()').extract_first(),
            }

Since we removed the Rules, we had to change the function name back to parse so that Scrapy calls it automatically on the 5 urls.

However, this technique becomes almost useless on large sites with hundreds of different pages to scrape with vastly different URLs. Hence, we create a set of rules instead which are to be followed by the Scrapy spider to determine which links to follow.

The benefit of this technique is that if there only a few specific pages you want scraped, you don’t have worry about any other pages and the problems involved with them.

There is another tutorial we have which explains how to get all the URLs/links from a web page. It also further expands on the Scrapy Link Extractors Class. Be sure to read up on it to further your understanding.

This marks the end of the Scrapy Link Extractors tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.