Python Scrapy Project Examples

This python article is a compilation of project examples of Scrapy.

This article isn’t a proper tutorial or article. It’s merely a collection of Scrapy programs from our various tutorials throughout our site, CodersLegacy. Each project example will have a brief description as to what it does, with a link to it’s respective tutorial where you can learn how to do it yourself.

You can also think of this as a place for you to get some ideas for your own Scrapy projects through the python examples we show you here.


Extracting Data

This is a Scrapy Spider with a rather simple purpose. It goes through the entire quotes.toscrape site extracting all available Quotes along with the name (Author) of the person who actually said the Quote.

Scraping an entire site can be a pretty complex task, which is why we are also using the Rules Class which define a set of rules for the Spider to follow while Scraping the site.

class SuperSpider(CrawlSpider):
    name = 'spider'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    base_url = 'http://quotes.toscrape.com'
    rules = [Rule(LinkExtractor(allow = 'page/', deny='tag/'),
                  callback='parse_filter_book', follow=True)]

    def parse_filter_book(self, response):
        for quote in response.css('div.quote'):
            yield {
                'Author': quote.xpath('.//span/a/@href').get(),
                'Quote': quote.xpath('.//span[@class= "text"]/text()').get(),

->Link to Tutorial


Extracting Links

This project example features a Scrapy Spider that scans a Wikipedia page and extracts all the links from it, storing them in a output file.

This can easily be expanded to crawl through the entire Wikipedia although the total time required to scrape through it would be very long.

from scrapy.spiders import CrawlSpider

class SuperSpider(CrawlSpider):
    name = 'extractor'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Python_(programming_language)']
    base_url = 'https://en.wikipedia.org'

    def parse(self, response):
        for link in response.xpath('//div/p/a'):
            yield {
                "link": self.base_url + link.xpath('.//@href').get()
            }

-> Link to Tutorial


Link Follower

The project example below is that of a Spider that “follows” links. This means that is can read a link, open the page to which it leads, and begin extracting data from that page. You can even follow links continuously till you’re spider has crawled and followed every link in the entire site.

You don’t have to include all the urls in the start_urls this way, just one is required.

The only reason we’ve set the depth limit to 1 is to keep the total time of the scraping reasonable (More on this in the tutorial).

from scrapy.spiders import CrawlSpider

class SuperSpider(CrawlSpider):
    name = 'follower'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ['https://en.wikipedia.org/wiki/Web_scraping']
    base_url = 'https://en.wikipedia.org'

    custom_settings = {
        'DEPTH_LIMIT': 1
    }

    def parse(self, response):
        for next_page in response.xpath('.//div/p/a'):
            yield response.follow(next_page, self.parse)

        for quote in response.xpath('.//h1/text()'):
            yield {'quote': quote.extract() }

-> Link to Tutorial


Scrapy Automated Login

Another powerful feature of Scrapy is FormRequest which allows for automated logins into sites. While most sites you want to scrape won’t require it, there are some sites whose data can only be accessed after a successful login.

Using FormRequest we can make the Scrapy Spider imitate this login, as we have shown below.

class ScrapySpider(CrawlSpider):
    name = 'login'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/login']

    def parse(self, response):
        inputs = response.css('form input')
        print(inputs)

        formdata = {}
        for input in inputs:
            name = input.css('::attr(type)').get()
            value = input.css('::attr(value)').get()
            formdata[name] = value

        formdata['username'] = 'YOUR_USERNAME'
        formdata['password'] = 'YOUR_PASSWORD'

        return scrapy.FormRequest.from_response(
            response,
            formdata = formdata,
            callback = self.parse_after_login
        )

    def parse_after_login(self, response):
        print(response.xpath('.//div[@class = "col-md-4"]/p/a/text()').get())

-> Link to Tutorial


Additional Features

Scrapy has many different features and opportunities to further enhance and improve your Spider. Putting aside the examples we discussed we above, we compiled all the important (main) features that might interest you.


This marks the end of the Python Scrapy Project Examples article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments