This is a tutorial on the use CSS selectors in Scrapy.
CSS is a language for applying styles to HTML elements in web pages. CSS in Scrapy defines “selectors” to associate these specific styles with specific HTML elements. It’s one of two options that you can use to scan through HTML content in web pages, the other being XPath.
In Scrapy, XPath offers more features than pure CSS selectors, however it’s a bit harder to learn. Generally it’s recommended to learn XPath, but if CSS selectors is what you understand better, then go for it.
CSS Selectors – Tutorial Objective
The goal of this tutorial is to teach you how to create expressions using CSS selectors for a wide variety of scenarios. These expressions are incharge of scanning the entire HTML page and returning only that information that we require.
It’s a bit difficult to explain through just the theory, so we’ll be using examples instead to demonstrate their working.
Scraping URLs in Scrapy using CSS Selectors
There are two things that one may be looking for while scraping a url in Scrapy. The url part of it, also known as href, and the link text of the url.
def parse(self, response):
for quote in response.css('a::text'):
yield {
"test" : quote.get()
}
The above returns the text of all the URLs in the <a>
HTML element.
def parse(self, response):
for quote in response.css('a::attr(href)'):
yield {
"test" : quote.get()
}
The above code returns the urls from the href
attributes of the <a> HTML elements in the document. You can use the same technique to return the values of other styles.
Scraping by classes in Scrapy using CSS Selectors
Often there are several types of the same type of element in a web page. Like there may be two sets of URLs, one for books and one for images. And now you want to scrape the URLs of only the books, so what will you do?
Luckily web developers usually assign different classes to such cases, to keep a way to differentiate (This is an important step in styling the web page).
In this section we’ll explain how to scan for a certain type of class of an element.
def parse(self, response):
for quote in response.css('div.content a::attr(href)'):
yield {
"test" : quote.get()
}
In the above code, only the URLs located in the divs with a class called “content” will be returned. This way we can narrow down our search results to the URLs we really want.
In short, the format is html_element.class_name
. Also, HTML elements in CSS expressions are separated by spaces.
def parse(self, response):
for quote in response.css('p.special::text)'):
yield {
"test" : quote.get()
}
The above code will return the text from paragraphs which have a class called “special
“.
def parse(self, response):
for quote in response.css('a.external::attr(href)'):
yield {
"test" : quote.get()
}
The above returns all the hyperlinks with the class of “external”. On certain sites, links pointing to external sites usually have a class name with the word “external”. This is a useful way to extract only those.
Scraping text in Scrapy
Alot of HTML elements store text one way or the other for various purposes. In this section we’ll explain how to retrieve text in this manner.
def parse(self, response):
for quote in response.css('h1::text'):
yield {
"test" : quote.get()
}
The above code returns the text from the H1 title tag in the Web page. Usually there’s only just one H1 though, the main title of the page.
def parse(self, response):
for quote in response.css('p::text'):
yield {
"test" : quote.get()
}
A simple example where we return the text from all paragraphs in the web page.
def parse(self, response):
for quote in response.css('div::text'):
yield {
"test" : quote.get()
}
The above code will return text that is contained directly within any Divs on the page. If you wish for the text within child element of the Div too, like paragraphs and hyperlinks, change it to div ::text
. The difference is that there is now a gap in between, representing space for other elements.
We already explained to retrieve the hyper linked text earlier.
Scraping images in Scrapy
In this tutorial section we’ll be demonstrating how to scrape images using CSS selectors in Scrapy.
def parse(self, response):
for quote in response.css('img::attr(src)'):
yield {
"test" : quote.get()
}
The above code returns the src
urls of all the images in the Web page. src
is an attribute where the URLs or “paths” of images are stored. The URLs of the image represent the location of where the image is stored.
Since src
is in attribute, like href
we place it into attr()
, preceeded by ::
.
def parse(self, response):
for quote in response.css('img::attr(alt)'):
yield {
"test" : quote.get()
}
alt
is another attribute of images which holds text to be displayed in case the image was not loaded correctly. The above code will return the alt
attributes of all the images in the web page.
Real-world example using CSS Selectors
Here is a practical example, utilizing several of the concepts we just discussed. We will be scraping the site "https://quotes.toscrape.com"
. This site is a collection of quotes by different authors. Our goal is to find all the quotes by “Albert Einstein”.
Here is the code. To better understand how this is working, visit the site itself, and observe the html structure.
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'Quote'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
for quote in response.css("div.quote"):
one, two = quote.css("span")
author = two.css("small::text").get()
if author == "Albert Einstein":
yield {
"quote" : one.css("::text").get()
}
This marks the end of the Scrapy CSS Selectors Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.