BeautifulSoup User Agents in Python

BeautifulSoup is a popular Python library that simplifies web scraping by allowing you easily extract useful information from HTML (web) files. In this article, we will explain what User Agents are, why they are essential for web scraping, and how to use them in BeautifulSoup.


What is a User Agent?

User-Agent is a header that identifies the client making an HTTP request. It typically contains information about the client’s software, operating system, and device type. Web servers use this information to determine how to handle the request, such as serving the appropriate web page version for a mobile device.


Why is User Agents important for Web Scraping?

User-Agent is critical for web scraping because some websites block requests from known web scraping bots or scripts. Websites can identify these bots by analyzing the headers in the HTTP request. Most automated scripts and libraries like BeautifulSoup and Scrapy have a “default” user-agent, which identifies them as a bot. This makes it easy for the a website to identify and block you.

By using a User-Agent header, web scrapers can mimic a web browser and avoid detection. We can use a generated a User agent, such as the one below to fake our presence as an actual browser.

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML%2C like Gecko) Chrome/97.0.4692.99 Safari/537.36 OPR/83.0.4254.62

We will use the above user agent in this tutorial.


How to use User Agents in BeautifulSoup?

First, you need to understand that BeautifulSoup actually has nothing to do with the internet. It is not responsible for sending a request to a website and downloading the HTML content. Neither it is responsible for sending any user agents.

All of this is the job of the requests library in Python, which fetches all of the HTML content of a web page, for BeautifulSoup to analyze.

So our actual question here is: “How to use User Agents in the Requests Module“.

Let’s begin.

We will start by examining the default headers that the Requests library is sending. The user agent information is typically stored within the headers of the Request object.

import requests

url = 'https://google.com'
response = requests.get(url)
print(response.request.headers["User-Agent"])
python-requests/2.25.1

As we can see from the output above, our python program is identifying itself as the “requests” library. This is an immediate give-away to any website that cares whether you are a bot or not.


Let’s fix this issue. By passing in a custom user agent into the headers parameters, we can over-ride the default user agent.

import requests

url = 'https://google.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

response = requests.get(url, headers=headers)
print(response.request.headers["User-Agent"])
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3

As you can see form the output, this time our user agent resembles an actual browser. You are now one step closer to remaining completely undetected and safe while scraping data off the internet.

It’s important to note that the User-Agent string used in the example above is just one example of many. You can find a list of User-Agent strings for popular web browsers and devices online. It’s essential to choose a User-Agent that closely matches the web browser and device that you want to mimic to avoid detection.


This marks the end of the BeautifulSoup User Agents – Python Tutorial. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the article content can be asked in the comments section below.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments