Scrapy XPath Tutorial - CodersLegacy

This is a tutorial on the use XPath in Scrapy.

XPath is a language for selecting nodes in XML documents, which can also be used with HTML. It’s one of two options that you can use to scan through HTML content in web pages, the other being CSS selectors.

XPath offers more features than pure CSS selectors, at the cost of being a bit harder to learn. In fact, the CSS selectors are actually being converted to XPath internally. XPath can seem complicated when paired next to it’s CSS equivalent, but it’s just as easy once you understand how it works.

It’s not that big of a deal, what you understand better is the better choice to use. But atleast give the Scrapy XPath a try before deciding on CSS selectors.

The goal of this tutorial is to teach you how to create expressions in XPath. These expressions are incharge of scanning the entire HTML page and returning only that information that we require.

It’s a bit difficult to explain through just the theory, so we’ll be using examples instead to demonstrate. You can also choose to watch the video version of this tutorial, which has a more interactive experiance.

Scraping URLs in Scrapy

There are two things that one may be looking for while scraping a url in Scrapy. The url part of it, also known as href, and the link text of the url.

    def parse(self, response):
        for quote in response.xpath('//a/text()'):
            yield {
                "test" : quote.get()
            }

The above returns the text of all the URLs in the <a> HTML element. You can think of text() as a sort of function we use to return text from an element.

    def parse(self, response):
        for quote in response.xpath('//a/@href'):
            yield {
                "test" : quote.get()
            }

The above code returns the urls from the href attributes of the <a> HTML elements in the document. You can use the same technique (@ + attribute) to return the values of other attributes.

Scraping by classes in Scrapy

Often there are several types of the same type of element in a web page. Like there may be two sets of URLs, one for books and one for images. And now you want to scrape the URLs of only the books, so what will you do?

Luckily web developers usually assign different classes to such cases, to keep a way to differentiate (This is an important step in styling the web page).

In this section we’ll explain how to scan for a certain type of class of an element.

    def parse(self, response):
        for quote in response.xpath('//div[@class = "content"]//a/@href'):
            yield {
                "test" : quote.get()
            }

In the above code, only the URLs located in the divs who have a class called “content” will be returned. This way we can narrow down our search results to the URLs that we really want.

In short, the format for searching by classes is html_element[@class = "class_name“].

The HTML elements in XPath expressions are separated by the / character. However, these characters represent a fixed path. If we had written //div[@class="content"]/a/@href then only URLs directly within the content div would have been returned.

However, writing //div[@class="content"]//a/@href returns any URLs that may be nested within elements within the div. Basically, this way even if the URL is buried deep within child elements of the div, it will be returned.

// can be thought of as a character for deep nesting. This is also why the XPath expression begins with //, since the div with the class of content is actually a child element itself. Not all XPath expressions start with // though (changes with the scenario).

    def parse(self, response):
        for quote in response.xpath('//p[@class = "paragraph"]/text()'):
            yield {
                "test" : quote.get()
            }

A simple example where a paragraphs with the class called “paragraph” have their text’s returned.

    def parse(self, response):
        for quote in response.xpath('//a[@class = "external text"]/@href'):
            yield {
                "test" : quote.get()
            }

The above returns all the hyperlinks with the class of “external”. On certain sites, links pointing to external sites usually have a class name with the word “external”. This is a useful way to extract only those.

Scraping text in Scrapy

Alot of HTML elements store text one way or the other for various purposes. In this section we’ll explain how to retrieve text in this manner.

    def parse(self, response):
        for quote in response.xpath('//h1/text()'):
            yield {
                "test" : quote.get()
            }

The above code returns the text from the H1 title tag in the Web page. Usually there’s only just one H1 though, the main title of the page.

    def parse(self, response):
        for quote in response.xpath('//p/text()'):
            yield {
                "test" : quote.get()
            }

A simple example where we return the text from all paragraphs in the web page.

    def parse(self, response):
        for quote in response.xpath('//div/text()'):
            yield {
                "test" : quote.get()
            }

The above code will return text that is contained directly within any Divs on the page. If you wish for the text within child element of the Div too, like paragraphs and hyperlinks, change it to //div//text().

We already explained to retrieve the hyper linked text earlier.

Scraping images in Scrapy

In this tutorial section we’ll be demonstrating how to scrape images using XPath in Scrapy.

    def parse(self, response):
        for quote in response.xpath('//img/@src'):
            yield {
                "test" : quote.get()
            }

The above code returns the src urls of all the images in the Web page. src is an attribute where the URLs or “paths” of images are stored. The URLs of the image represent the location of where the image is stored.

Since src is in attribute, like href we add @ before writing it into the XPath.

    def parse(self, response):
        for quote in response.xpath('//img/@alt'):
            yield {
                "test" : quote.get()
            }

alt is another attribute of images which holds text to be displayed in case the image was not loaded correctly. The above code will return the alt attributes of all the images in the web page.

This marks the end of the Scrapy XPath Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.