This tutorial covers the Scrapy library in Python.
What is Web Scraping?
Web Scraping is the act of downloading or extracting data from web pages. This is done with the use of web scrapers such as Scrapy. It’s also commonly referred to as Web Crawling or Web Spidering, but they all share the same theme.
Web Scraping is often used to create automated web scrapers which periodically extract certain data from web pages. For instance, it got download the weather data from a weather forecasting site everyday and store it on a file on your computer.
You could scrape the prices of items from a site looking for the cheapest item. Or you could use it to extract all the links from a web page, or an entire site. The opportunities are unlimited.
Python Web Scrapers
Python has two main Web scrapers, Scrapy and BeautifulSoup. Before we proceed, any further, we’ll explain what makes Scrapy so great by comparing it and BeautfiulSoup.
Both of them are free web scrapers so they are freely available to download and install. The key difference between the two is that BeautifulSoup is only a data parser and extractor, and cannot be used to retrieve information on it’s own. Hence it is always used together with a library such as requests.
Scrapy on the other hand is a one in all library able to download, process and save web data all on it’s own. Scrapy also doubles as a web crawler (or spider) due to it’s ability to automatically follow links on web pages.
If you’re looking for a simple content parser, BeautifulSoup is probably the better choice. It’s simple and easy to use, whereas Scrapy is a bit more complex and has a steeper learning curve. Scrapy is also more suited to larger projects due to it’s superior speed and scalability.
Getting Started with Scrapy
Scrapy is a complicated beast, requiring many libraries (dependencies) and a lot of set up before it can be used. Furthermore, for people with absolutely no knowledge about HTML or CSS might have a bit of trouble too. As such, you’ll have to read the following articles before continuing.
- First start off with our Scrapy Installation Guide which will walk you through various possible scenarios and installation methods.
- Next is the Scrapy Project Creation guide. Scrapy projects require more preparation and setup than the a regular program. This guide will explain the process from start to finish.
You’re almost ready for the proper scraping tutorials below. Scrapy has many small concepts, and we can’t repeatedly explain them in every tutorial. Hence we’ve explained them all here in this Scrapy Basics tutorial.
Using Expressions and Selectors in Scrapy
In order to extract data from sites, Scrapy uses “expressions”. These scan through all the available data and select only that information that we require. You can think of these expressions as a set of rules defining the data we need from the raw scraped data.
We have the choice of using either:
…to create these expressions in Scrapy. You ought to try both before actually picking one for certain. CSS selectors are generally a bit easier to learn, but XPath is a little more powerful and faster. That said, there will be occasions where one will perform better over the other. Which is another reason why you should know both.
Important Scrapy Concepts
Once you learn how to use selectors, you can begin web-scraping basic websites. However, to improve your web scraping techniques, write scalable code, by-pass security on well-protected sites, you need to explore the following concepts.
Here are some of these important Python Scrapy concepts, for which we have a tutorial available.
- Scrapy Items for Data Grouping
- Scrapy Pipelines for Data processing
- Scrapy Cookies for personalized sessions
- Rotating Proxies in Scrapy for swapping IPs
- Scrapy User-agents to fake browser and OS
- Scrapy Shell for rapid testing
- Scrapy logging for Debugging and Log Keeping
Scrapy Project Tutorials
Scrapy, as we mentioned before can be used for a variety of different tasks. Because of this we can’t possibly cover all the various uses within a single tutorial. Hence we’ve created multiple tutorials, each covering a single unique capability of Scrapy.
We’ve narrowed it down to three main uses of Scrapy, listed below. It’s strongly recommended you read through all of them. These tutorial concepts overlap in many areas and help you understand Scrapy as a Python Web crawler better. (Do them in order for the best experience)
Scrapy Data Extractor: This tutorial covers all the basics of scraping data from websites. This tutorial also covers the Link Extractor and Rule classes, which can add an extra layer of functionality to your Scrapy bot while it scrapes.
Scrapy Link Follower: Teaches you how to create a Scrapy bot that can keep following links. The same concept is used in web crawlers like Googlebot (Google’s web crawler). Associated problems and settings are also discussed here.
Extracting links using Scrapy: Similar to the Data Extraction tutorial, but with a much larger focus on extracting links. Demonstrates how to extract links from multiple sites, links of a certain class and more.
Automating Logins with Scrapy: , Scrapy approaches the website as a guest user. Alot of information in some sites is often hidden behind login pages however. To circumvent this, Scrapy offers us the Formrequest feature with which we can easily automate a login into any site.
Dangers of Web Scraping
One final note about Web scraping. Certain websites will not appreciate the presence of bots and have counter measures set up such as captcha (There are ways to get around this though). This is just a reminder for you to be careful about where you run your Scrapy bots.
For instance, a place like Wikipedia is a good place to practice. The reason is that the data on Wikipedia is freely available, so there won’t be any issues regarding data being scraped (as long as you scrape within limits). There are also sites designed specifically for Scrapers like us to be practicing on such QuotestoScrape and BookstoScrape.
Web Scraping is typically done on sites with publicly available data, such as weather forecasting, sports etc. As a general rule, if you’re web scraper is harmless and you’re not potentially harming another person’s business, you should be safe.
Interested in more Scrapy Tutorials? Check out this article with several Scrapy Projects.
This marks the end of the Python Scrapy Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article material can be asked in the comments section below.