This Article is about Python Beautifulsoup, version bs4 and it’s use in web scraping.
What is BeautifulSoup?
BeautifulSoup is a web scraping library in Python. Or, in more formal lingo, it is used to extract meaningful data from HTML and XML files. What is web scraping though? Also known as Web Data extraction, it is the act of extracting data from the websites. See more on Web Scraping here at Wikipedia.
Before we get into the real stuff, let’s go over a few basic things first. For one, you might ask what’s the meaning of the term ‘bs4’. It actually stands for BeautifulSoup 4, which is the current version of BeautifulSoup. BeautifulSoup 3’s development stopped ages ago and it’s support will be discontinued by December 31st 2020.
BeautifulSoup (bs4) is a python library that’s dependent on other libraries to function properly. You can’t simply use BeautifulSoup alone to acquire data off a website.
For one, you need a library like requests to actually connect to the website itself first. And since BeautifulSoup doesn’t have advanced features like it’s counterpart, Scrapy, you might end up needing one or two more. Most tasks will only require two (requests and bs4) however, so don’t stress.
Don’t be scared off though! It’s much easier than it seems. Once you start out, things will become significantly easier. You’ll find that practice will greatly improve your web scraping skills. Moreover, you’ll find that many concepts will carry over should decide to continue with Scrapy.
Understanding the process
Web scraping itself is a three part process. BeautifulSoup just happens to represent the most difficult part of it. The first two steps will be over within a few lines.
- Connecting to the website successfully with the requests library and extracting HTML and XML data from the website successfully
- Parsing the HTML and XML data with BeautifulSoup. This works by creating a soup object from the extracted the data. The soup object contains the parsed data.
- Navigating through the parsed data and retrieving the information we need.
It’s difficult to define one specific part of the Python BeautifulSoup syntax, so first we’ll show you how to create the object, followed by the many commands used in bs4.
For now, we are going to assume that we already have the html data extracted somewhere. We’ll demonstrate the full process by the end of this.
from bs4 import BeautifulSoup soup = BeautifulSoup(html_file, 'html.parser')
The BeautifulSoup function in the above code parses through the html files using the
html.parser and creates a soup object, stored in soup. Once you have this object, you can carry out commands to retrieve information about the page. See below.
soup.title # <title> Page Title </title> soup.title.name # 'title' soup.title.string # 'Page Title' soup.title.parent.name # 'head' soup.p # <p class="title"> Text Text Text </p> soup.p['class'] # 'title' soup.a # <a href = "http://example.com/object1" id="link1"> Random Link 1 </a> soup.find_all('a') # [<a href="http://example.com/object1" id="link1"> Random Link 1 </a>, # <a href="http://example.com/object2" id="link2"> Random Link 2 </a>, # <a href="http://example.com/object3" id="link3"> Random Link 3 </a>] soup.find(id="link3") # <a href="http://example.com/object3" id="link3">Random Link 3</a>
This code contains several types of functions that can be carried out and there expected outputs.
Usually simply calling a element with soup.a or soup.p will return both it’s contents and the HTML. To separate the HTML and contents, you have to use another method. Learn how to do so below.
We’ll be using the Wikipedia Web scraping page in this practice example. We’ll be going through this step by step.
from bs4 import BeautifulSoup import requests req = requests.get("https://en.wikipedia.org/wiki/Web_scraping") soup = BeautifulSoup(req.content, 'html.parser')
This part of the code imports the nessacery libraries and connects to our target site. A
Response object is stored within
req. Using the
.content function on the
Response object returns the HTML data. We can then parse through this data, creating a Soup object.
The requests module is an important part of BeautifulSoup! Make sure to read on it in this requests tutorial.
Dealing with Hyperlinks
print(soup.a) X = soup.find_all('a') count = 0 for i in X: count += 1
Simply calling the soup.a function will only result in a single URL printed. If you call it, the first URL on the page is printed. Calling it on a paragraph would return the first URL in that paragraph. The
find_all() however returns all hyperlinks. Here, we use it to count all the hyperlinks in the page. See the Output below.
<a id="top"></a> 350
Keep in mind that these hyperlinks have their HTML content attached. But what if we want only the text? or maybe just the hyperlink? For this we can use the
href respectively. We’ll be using the second hyperlink on this page, as the first doesn’t appear to have any content.
X = <a class="mw-jump-link" href="#mw-head">Jump to navigation</a> print(b['href']) print(b.get_text())
We use the
object['attribute'] format when searching for attributes within an HTML element. Examples of attributes include
class. For hyperlinks though, you can also use the
link.get('href') function. It returns the exact same value as the
object.get_text() will return the text in a given object. If you use it on a paragraph, it will return all the text in that paragraph without any HTML tags. Furthermore, if you use this function on the soup object itself, all the text in the page will be returned.
This is the output of the above code.
#mw-head Jump to navigation
Dealing with Paragraphs
Along with hyperlinks, paragraphs are one of the most common tags found on pages. There are rarely any tag specific functions. The
find_all function and the
get_text() function will work on almost all tags. (Not all elements have text, so
get_text would have no effect). We’ll demonstrate the use of some functions below.
print(soup.p.get_text()) X = soup.find_all('p') count = 0 for i in X: count += 1
Here we used
find_all('p')to count all the paragraphs. And we use soup.p.get_text() to print out the text of the first paragraph. We could loop through the
find_all('p') function as well using the
get_text() function but we would have thousands of lines printed out on screen then.
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis. 31
find() function is used to locate elements that meet a certain condition. You can use any identifying trait to locate an element. For instance,
find(class = "title") will find all those elements with the class
While web scraping you’ll come across many different hurdles. It’s important to remain flexible and adapt accordingly.
For one, many websites don’t appreciate the presence of bots and have counter measures set up such as captcha (There are ways to get around this though). This is also a reminder for you to be careful about where you web-scrape. For instance, some place like Wikipedia is a good place to practice. The data on Wikipedia is freely available, so they won’t have any issues. Web Scraping is typically done on sites with publicly available data, such as weather forecasting, sports etc.
Head back to the main Python Libraries section using this link.
This marks the end of our Python BeautifulSoup (bs4) Guide. Hope you liked it. Let us know if you have any questions or suggestions to make. Any response helps grow CodersLegacy.