This tutorial explains how to follow links using Python Scrapy.
Creating a Scrapy bot that follows links is a pretty popular demand that people have from Scrapy. If you know anything about search engines like Google, you’ll know that they use crawlers to search through entire net, following links till they have everything indexed in their database.
We’ll be recreating such a web crawler here using Python Scrapy that can follow links from web page to another.
(The reason why people aren’t making their own search engines left and right is because it requires a massive amount of computing power to be scanning through trillions of links everyday)
Scrapy – Follow Links Example
The start_urls
has been assigned the url of the web scraping page on Wikipedia. You may start from wherever you wish (depending on your goal) such as the homepage of Wikipedia.
We’ve kept the allowed_domains as only the English Wikipedia, en.wikipedia.org
. This prevents the Scrapy bot from following and scraping links on domains other Wikipedia. You may remove this system if you wish to, but be aware of the possible effects.
The DEPTH_LIMIT setting is also very important. Assigning it a value of 1 ensures that it only follows a link to a depth of 1. Which basically means, it will follow the link to it’s page, but will not follow the links found on that new page.
from scrapy.spiders import CrawlSpider, Rule
class SuperSpider(CrawlSpider):
name = 'follower'
allowed_domains = ['en.wikipedia.org']
start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
base_url = 'https://en.wikipedia.org'
custom_settings = {
'DEPTH_LIMIT': 1
}
def parse(self, response):
for next_page in response.xpath('.//div/p/a'):
yield response.follow(next_page, self.parse)
for quote in response.xpath('.//h1/text()'):
yield {'quote': quote.extract() }
There are two for loops in the parse
function. (The parse function is called automatically by the Scrapy bot) The first for loop is responsible for following links found on the page, and the second for loop is responsible for extracting the text from the H1 HTML element (the title).
Just so you understand, the links that we want to scrape in Wikipedia are contained in paragraphs that are similarly contained within divs. The XPath we have defined (.//div/p/a
) will only return the links from the content, not random locations, such as the login link. Examining the layout of the page is important before attempting to scrape it.
Here are the first 10 records. As you can see, almost all of them are pretty relevant to Web Scraping. The higher the depth limit, the more “varied” the search results will become as we get further and further from the start url.
{"quote": "Web scraping"},
{"quote": "Data scraping"},
{"quote": "Comparison shopping website"},
{"quote": "Data mining"},
{"quote": "XHTML"},
{"quote": "HTML"},
{"quote": "Web page"},
{"quote": "Web mining"},
{"quote": "Web indexing"},
{"quote": "Web data integration"},
As you’ve probably noticed, we’ve used XPath twice in the above example. Here’s a CSS selectors way of doing it, if you’re interested.
def parse(self, response):
for next_page in response.css('div.mw-parser-output > p > a'):
yield response.follow(next_page, self.parse)
for quote in response.css('div.mw-parser-output > p'):
yield {'quote': quote.extract()}
Changing the DEPTH_LIMIT in Scrapy
Just as a test, I changed the DEPTH limit from 1 to 2 and then ran the code to see the result.
I let the code run for almost 10 minutes, then stopped it manually (Cltr + C) since it just wasn’t stopping. We ended up with over 2200 titles in our follower.json
file. We likely would have ended up at around 10,000 had the code continued until it’s completion. The point is to show that increasing the depth limit increases the required time exponentially.
A large website like Wikipedia which has hundreds of links on each page, which each link back to pages with hundreds of links will takes a long time. This is isn’t the case for smaller websites though.
This doesn’t mean you shouldn’t increase the CUSTOM_DEPTH to values greater than 1. It means you should consider your situation and goal carefully and decide accordingly.
If you’re goal is scrape the entire Wikipedia, then you could simply remove the DEPTH_LIMIT option, removing any restrictions.
Read this tutorial on settings in scrapy to learn about other great settings.
This marks the end of the Following Links in Scrapy tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.
This seems to work better just importing scrapy, and inheriting from scrapy.Spider, rather than from CrawlSpider. I don’t know enough about Scrapy yet to understand why.