Scrapy - Extract links from Web Pages

This tutorial explains how to extract/get links using Scrapy.

There are many things that one may be looking for to extract from a web page. These include, Text, Images, HTML elements and most importantly, URLs (Uniform Resource Locators).

In this Scrapy tutorial we’ll explain how to scrap and download links from websites into a JSON file. We’ll be experimenting on two different sites, Wikipedia and BookstoScrape.

Scraping Wikipedia

We’re going to be Scraping the Python Wikipedia page, returning all content links into a JSON file.

Why did we create the XPath the way we did? It’s because a Wikipedia web page actually has a whole bunch of different URLs like login urls, external links, citations etc. Searching within paragraphs for links will ensure that we only return the links that appeared within the content.

from scrapy.spiders import CrawlSpider

class SuperSpider(CrawlSpider):
    name = 'extractor'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']
    base_url = 'https://en.wikipedia.org'

    def parse(self, response):
        for link in response.xpath('//div/p/a'):
            yield {
                "link": self.base_url + link.xpath('.//@href').get()
            }

The reason we add the base_url into the returned url is to complete it. The above code returns only internal links. Internal links by default don’t include the domain name, so they look like this: /wiki/Interpreted_language. Concatenating the base url into them makes them complete.

These are some of the records that we copied from the output to show you.

{"link": "https://en.wikipedia.org/wiki/Interpreted_language"},
{"link": "https://en.wikipedia.org/wiki/High-level_programming_language"},
{"link": "https://en.wikipedia.org/wiki/General-purpose_programming_language"},

Scraping external links

Upon close inspection, you’ll notice that there all kinds of URLs in Wikipedia. The ones we scraped above are just one example.

Another type of URL found on Wikipedia pages are external links, which point to sites other than Wikipedia. The URLs are specially marked with a class called "external text".

The below code explains how specifically extract only the external links.

class SuperSpider(CrawlSpider):
    name = 'extractor'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']
    base_url = 'https://en.wikipedia.org'

    def parse(self, response):
        for link in response.xpath('//a[@class = "external text"]'):
            yield {
                "link": link.xpath('.//@href').get()
            }

Three records we picked at random from the output to show you. Notice the different domain names.

{"link": "https://www.python.org/"},
{"link": "https://www.python.org/downloads/release/python-385/"},
{"link": "https://bugs.python.org"},

Scraping BooksToScrape

After careful inspection of the site, we noticed there are 1000 books in total. They are categorized in two different formats, page wise and genre wise. If we scrape the entire site without any limitation, we will end up with many duplicated URLs since the URL for one specific book is repeated many times through out the site.

If you’ve read our Link extractor tutorial, you’ll remember that we faced a similar problem there that we solved using the Link Extractor and Rules.

The below code has a rule that only allows the Scrapy bot to scrape URLs from the main category called books_1 where all 1000 are listed, divided amongst 50 pages.

How did we know that we had to write books_1? We carefully inspected the URLs of different categories, looking for a difference that we could use to ensure that only the main books category was scraped. Books_1 was that unique word we were looking for that only appeared in URLs for the 50 pages.

class SuperSpider(CrawlSpider):
    name = 'extractor'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    base_url = 'http://books.toscrape.com/catalogue'
    rules = [Rule( LinkExtractor(allow = 'books_1/'),
            callback='parse_func', follow=True)]

    def parse_func(self, response):
        for link in response.xpath('//h3/a'):
            url = link.xpath('.//@href').get()
            final_url = self.base_url + url.replace('../..', '')
            yield {
                "link": final_url
            }

All the URLs of the books were located within heading tags. Hence we created an XPath expression, '//h3/a' to avoid any non-books URLs. Basically this XPath expression will only locate URLs within headings of size h3.

The next line adds the base url into the returned URL to complete it. The formatting on the returned URLs is rather weird, as it is preceded by a ../.. segment (Each site is different after all). We use the replace method to get rid of it and replace it with empty space.

The first three records returned from our above code.

{"link": "http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"},
{"link": "http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"},
{"link": "http://books.toscrape.com/catalogue/soumission_998/index.html"},

Following Links

If you go through our “Following Links with Scrapy” Tutorial, you can combine the concept you just learned in this article, with the link following concept. This will enable you to create a proper Crawler bot that crawls through the entire internet, following and extracting links.

This marks the end of the Scrapy, Extract Links Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.