This tutorial covers the Scrapy library in Python.
Web Scraping is the act of downloading or extracting data from web pages. This is done with the use of web scrapers such as Scrapy. It’s also commonly referred to as Web Crawling or Web Spidering, but they all share the same theme.
Web Scraping is often used to create automated web scrapers which periodically extract certain data from web pages. For instance, it got download the weather data from a weather forecasting site everyday and store it on a file on your computer.
You could scrape the prices of items from a site looking for the cheapest item. Or you could use it to extract all the links from a web page, or an entire site. The opportunities are unlimited.
Python Web Scrapers
Python has two main Web scrapers, Scrapy and BeautifulSoup. Before we proceed, any further, we’ll explain what makes Scrapy so great by comparing it and BeautfiulSoup.
Both of them are free web scrapers so they are freely available to download and install. The key difference between the two is that BeautifulSoup is only a data parser and extractor, and cannot be used to retrieve information on it’s own. Hence it is always used together with a library such as requests.
Scrapy on the other hand is a one in all library able to download, process and save web data all on it’s own. Scrapy also doubles as a web crawler (or spider) due to it’s ability to automatically follow links on web pages.
If you’re looking for a simple content parser, BeautifulSoup is probably the better choice. It’s simple and easy to use, whereas Scrapy is a bit more complex and has a steeper learning curve. Scrapy is also more suited to larger projects due to it’s superior speed and scalability.
Getting Started with Scrapy
Scrapy is a complicated beast, requiring many libraries (dependencies) and a lot of set up before it can be used. Furthermore, for people with absolutely no knowledge about HTML or CSS might have a bit of trouble too. As such, you’ll have to read the following articles before continuing.
First start off with our Scrapy Installation Guide which will walk you through various possible scenarios and installation methods.
Next is the Scrapy Project Creation guide. Scrapy projects require more preparation that average program. This guide will explain the process.
You’re almost ready for the proper scraping tutorials below. Scrapy has many small concepts, and we can’t repeatedly explain them in every tutorial. Hence we’ve explained them all here in this Scrapy Basics tutorial.
(OPTIONAL) In order to extract data from sites, Scrapy uses “expressions”. These scan through all the available data and select only that information that we require. You can think of these expressions as a set of rules defining the data we need.
We have the choice of using either CSS selectors or XPath to create these expressions in Scrapy. You ought to try both before actually picking one for certain. Whether you decide to learn this step after the tutorials below, or before is upto you.
Scrapy Project Tutorials
Scrapy, as we mentioned before can be used for a variety of different tasks. Because of this we can’t possibly cover all the various uses within a single tutorial. Hence we’ve created multiple tutorials, each covering a single unique capability of Scrapy.
We’ve narrowed it down to three main uses of Scrapy, listed below. It’s strongly recommended you read through all of them. These tutorial concepts overlap in many areas and help you understand Scrapy as a Python Web crawler better. (Do them in order for the best experience)
Scrapy Data Extractor: This tutorial covers all the basics of scraping data from websites. This tutorial also covers the Link Extractor and Rule classes, which can add an extra layer of functionality to your Scrapy bot while it scrapes.
Scrapy Link Follower: Teaches you how to create a Scrapy bot that can keep following links. The same concept is used in web crawlers like Googlebot (Google’s web crawler). Associated problems and settings are also discussed here.
Extracting links using Scrapy: Similar to the Data Extraction tutorial, but with a much larger focus on extracting links. Demonstrates how to extract links from multiple sites, links of a certain class and more.
One final note about Web scraping. Certain websites will not appreciate the presence of bots and have counter measures set up such as captcha (There are ways to get around this though). This is just a reminder for you to be careful about where you run your Scrapy bots.
For instance, a place like Wikipedia is a good place to practice. The reason is that the data on Wikipedia is freely available, so there won’t be any issues regarding data being scraped. There are also sites designed specifically for Scrapers like us to be practicing on such QuotestoScrape and BookstoScrape.
Web Scraping is typically done on sites with publicly available data, such as weather forecasting, sports etc. As a general rule, if you’re web scraper is harmless and you’re not potentially harming another person’s business, you should be safe.
This marks the end of the Python Scrapy Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article material can be asked in the comments section below.