Scrapy User Agents - CodersLegacy

This tutorial explains how to use custom User Agents in Scrapy.

A User agent is a simple string or a line of text, used by the web server to identify the web browser and operating system.

When a browser connects to a website, the User agent is a part of the HTTP header sent to the website. The contents of the User agent will of course, vary from browser to browser, and operating system to operating system.

Using Custom User Agents in Scrapy

First, let’s take a look at our default user agent, and understand why it is a problem. By default, Scrapy identifies itself as a Scrapy bot when accessing websites. Naturally, this can easily result in the bot being blocked by the website.

To check your user agent, you can look within the headers of your request object. The headers is a dictionary which contains several key-value pairs. In the below code we have printed out the “User-Agent” value.

import scrapy

class QuoteSpider(scrapy.Spider):
    name = 'Quote'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def start_requests(self):
        for url in self.start_urls:
            request = scrapy.Request(url, self.parse)
            yield request

    def parse(self,response):   # <------
        print(response.request.headers["User-Agent"])

b'Scrapy/2.8.0 (+https://scrapy.org)'

As you can see, our Scrapy spider is identifying itself a scrapy bot, rather than an actual browser. This is easily going to get us blocked from site that has implemented anti-bot protection.

You can change the User agent by dropping the following line in the settings.py file.

USER_AGENT: 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

Using the Google user agent, known as googlebot is an almost foolproof tactic. It would be foolish for a site to have googlebot blocked, which would prevent them from being indexed on Google. Hence with the googlebot user agent you can access any site without issues.

Now when we run the code again, the new user agent will show instead.

This was just the simple way of using User agents in Scrapy, which is probably enough for simple cases. If you really want to take full advantage of User agents in Scrapy, you need to read the next section.

Finding User Agents

You might be wondering where to get user agents from. Well, you could always use your own. To find out what it is, simply type “my user agent” into google, and it will tell you.

You can also go find randomly generated user agents online.

Rotating User Agents in Scrapy

Another important technique to implement is rotating user agents. Using Google’s user agent only solves one of many problems. While the user agent is secure and not restricted, it’s going to be suspicious if thousands of requests start arriving from it in a short period of time.

Naturally the solution is to use rotating user agents for several different browsers. This will basically make it appear as if there are many different browsers visiting the targeting site, instead of a single one.

To get started, install this library which contains the middleware for rotating user agents. It’ll add on directly to your Scrapy installation, you just have to run the following command in the command prompt.

pip install scrapy-user-agents

Remember to remove any other User agents you may have set in the settings.py file or in the local settings. This library will be the one providing the user agents.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

What this basically does is turn the original built in UserAgentMiddleware off, replacing it the special RandomUserAgenetMiddleware setting.

User agent library has 2200 user agents stored in a file, meaning that’s 2200 user agents through which it can rotate and impersonate many different browsers. This is course helps you mask the presence of your Spider while scraping in large quantities.

Conclusion

And that’s all. You don’t need to make any changes to the actual code itself. You can just run whatever spider you’ve been using so far and the settings will be automatically applied.

If you want to know how make these settings local (for a single spider) refer to this setting tutorial. If you want to learn more about actually creating Spiders, Crawlers and Followers in Scrapy, refer to our Scrapy Tutorial.

User Agents and Proxies

As expected of Scrapy, there are a combination of different techniques we can use to mask the presence of our Scrapy bot. One other similar and very powerful technique is the use of proxies.

While user agents are used to mask the browser being used, we use proxies to mask the IP being used to access the site. Similar to user agents, we can also rapidly change your IP using these proxies to hide your true IP and location.

This marks the end of the Scrapy User Agents tutorial. Suggestions or Contributions for Coderslegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Using Custom User Agents in Scrapy

Finding User Agents

Rotating User Agents in Scrapy

Conclusion

User Agents and Proxies

Leave a Comment Cancel reply