Scrapy User Agents



This tutorial explains how to use User Agents in Scrapy.

A User agent is a simple string or a line of text, used by the web server to identify the web browser and operating system.

When a browser connects to a website, the User agent is a part of the HTTP header sent to the website. The contents of the User agent will of course, vary from browser to browser, and operating system to operating system.


Getting Started

The first thing you need to do is actually install the Scrapy user agents library. It’ll add on directly to your Scrapy installation, you just have to run the following command in the command prompt.

pip install scrapy-user-agents

By default, Scrapy identifies itself as a Scrapy bot when accessing websites. Naturally, this can easily result in the bot being blocked by the website. You can change the User agent by dropping the following line in the settings.py file.

'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

Using the Google user agent, known as googlebot is an almost foolproof tactic. It would be foolish for a site to have googlebot blocked, which would prevent them from being indexed on Google. Hence with the googlebot user agent you can access any site without issues.

This was just the simple way of using User agents in Scrapy, which is probably enough for simple cases. And it doesn’t even require the use of the scrapy-user-agents library. If you really want to take full advantage of User agents in Scrapy, you need to read the next section.


Rotating User Agents in Scrapy

Another important technique to implement is rotating user agents. Using Google’s user agent only solves one of many problems. While the user agent is secure and not restricted, it’s going to be suspicious if thousands of requests start arriving from it in a short period of time.

Naturally the solution is to use rotating user agents for several different browsers. This will basically make it appear as if there are many different browsers visiting the targeting site, instead of a single one.

Remember to remove any other User agents you may have set in the settings.py file or in the local settings.

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

What this basically does is turn the original built in UserAgentMiddleware off, replacing it the special RandomUserAgenetMiddleware setting.

User agent library has 2200 user agents stored in a file, meaning that’s 2200 user agents through which it can rotate and impersonate many different browsers. This is course helps you mask the presence of your Spider while scraping in large quantities.


Conclusion

And that’s all. You don’t need to make any changes to the actual code itself. You can just run whatever spider you’ve been using so far and the settings will be automatically applied.

If you want to know how make these settings local (for a single spider) refer to this setting tutorial. If you want to learn more about actually creating Spiders, Crawlers and Followers in Scrapy, refer to our Scrapy Tutorial.


User Agents and Proxies

As expected of Scrapy, there are a combination of different techniques we can use to mask the presence of our Scrapy bot. One other similar and very powerful technique is the use of proxies.

While user agents are used to mask the browser being used, we use proxies to mask the IP being used to access the site. Similar to user agents, we can also rapidly change your IP using these proxies to hide your true IP and location.


This marks the end of the Scrapy User Agents tutorial. Suggestions or Contributions for Coderslegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Leave a Reply

Your email address will not be published. Required fields are marked *