10 Most Important Functions in BeautifulSoup

Beautiful Soup is a Python library that is commonly used for web scraping purposes. It is a very powerful tool for extracting and parsing data from HTML and XML files. Beautiful Soup provides several functions that make web scraping a lot easier. In this article, we will look at the 10 most important BeautifulSoup functions and how to use them to parse data.

1. BeautifulSoup()

The BeautifulSoup() function is used to create a Beautiful Soup object. This object represents the parsed HTML/XML document. It takes two arguments: the HTML/XML document as a string and the parser to be used. The parser is optional, and if it is not specified, Beautiful Soup will automatically select one based on the document.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <p class='text'>Some text.</p>
  </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser')

In this example, we are creating a Beautiful Soup object from an HTML string using the html.parser parser. Printing out the soup object will show you all the html it currently has stored within it.

2. find()

The find() function is used to find the first occurrence of a tag in the HTML/XML document. It takes two arguments: the name of the tag and any attributes associated with it. The attributes are optional.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <p class='text'>Some text.</p>
  </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser')
p_tag = soup.find('p', {'class': 'text'})
print(p_tag)

<p class="text">Some text.</p>

In this example, we are finding the first occurrence of the p tag with the class attribute set to 'text'.

3. find_all()

The find_all() function is used to find all occurrences of a tag in the HTML/XML document. It takes the same arguments as find().

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
        <title>The Title</title>
    </head>
    <body>
        <p class='text'>Some text.</p>
        <p class='text'>More text.</p>
    </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser')
p_tags = soup.find_all('p', {'class': 'text'})
print(p_tags)

[<p class="text">Some text.</p>, <p class="text">More text.</p>]

In this example, we are finding all occurrences of the p tag with the class attribute set to 'text'.

4. get_text()

The get_text() function is used to get the text content of a tag. It takes no arguments.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <p class='text'>Some text.</p>
  </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser')
p_tag = soup.find('p', {'class': 'text'})
text = p_tag.get_text()

print(text)

Some text.

In this example, we are getting the text content of the p tag we found earlier.

5. get()

The get() function is used to get the value of an attribute of a tag. It takes one argument, which is the name of the attribute.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <p class='text'>Some text.</p>
  </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser') 
p_tag = soup.find('p', {'class': 'text'}) 
class_attribute = p_tag.get('class')

print(class_attribute)

['text']

In this example, we are getting the value of the class attribute of the p tag we found earlier. This works for other attributes like “href” and “id” as well.

6. find_parent()

The find_parent() function is used to find the parent tag of a given tag. It takes no arguments.

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
        <title>The Title</title>
    </head>
    <body>
        <div>
            <p class='text'>Some text.</p>
        </div>
    </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser') 
p_tag = soup.find('p', {'class': 'text'}) 
div_tag = p_tag.find_parent('div')

print(div_tag)

<div>
<p class="text">Some text.</p>
</div>

In this example, we are finding the parent div tag of the p tag we found earlier.

7. find_next_sibling()

The find_next_sibling() function is used to find the next sibling tag of a given tag. It takes no arguments.

from bs4 import BeautifulSoup

html_doc = """
<html>
    <head>
        <title>The Title</title>
    </head>
    <body>
        <p class='text'>Some text.</p>
        <p class='text'>More text.</p>
    </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser') 
p_tag = soup.find('p', {'class': 'text'}) 
next_p_tag = p_tag.find_next_sibling('p')

print(next_p_tag)

<p class="text">More text.</p>

In this example, we are finding the next p tag that comes after the p tag we found earlier.

8. find_all_next()

The find_all_next() function is used to find all the tags that come after a given tag in the HTML/XML document. It takes no arguments.

from bs4 import BeautifulSoup

html_doc = """
<html>
  <head>
    <title>The Title</title>
  </head>
  <body>
    <div>
      <p class='text'>Some text.</p>
      <p class='text'>More text.</p>
      <span class='text'>Some more text.</span>
    </div>
  </body>
</html>"""

soup = BeautifulSoup(html_doc, 'html.parser') 
p_tag = soup.find('p', {'class': 'text'}) 
next_tags = p_tag.find_all_next()

print(next_tags)

[<p class="text">More text.</p>, <span class="text">Some more text.</span>]

In this example, we are finding all the tags that come after the p tag we found earlier.

9. select()

The select() function is one of the most important functions in BeautifulSoup, used to select tags based on CSS selectors. It takes one argument, which is the CSS selector.

from bs4 import BeautifulSoup

html_doc = """

<html>
    <head>
        <title>The Title</title>
    </head>
    <body>
        <div>
            <p class='text'>Some text.</p>
        </div>
        <div>
            <p class='text'>More text.</p>
        </div>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser') 
p_tags = soup.select('div > p.text')
print(p_tags)

[<p class="text">Some text.</p>, <p class="text">More text.</p>]

In this example, we are selecting all the p tags with the class attribute set to ‘text’ that are inside a div tag.

10. prettify()

The prettify() function is used to make the HTML/XML document more human-readable. It takes no arguments.

from bs4 import BeautifulSoup

html_doc = """<html><head><title>The Title</title></head><body><p class='text'>Some text.</p></body></html> """

soup = BeautifulSoup(html_doc, 'html.parser') 
prettified_html = soup.prettify()
print(prettified_html)

<html>
 <head>
  <title>
   The Title
  </title>
 </head>
 <body>
  <p class="text">
   Some text.
  </p>
 </body>
</html>

In this example, we are making the HTML document more human-readable using the prettify() function.

Conclusion

Beautiful Soup is a powerful tool for web scraping in Python. In this article, we have covered the 10 most important functions of Beautiful Soup and how to use them to parse data from HTML and XML files. These functions are just a few of the many functions provided by Beautiful Soup, and by mastering them, you can become an expert in web scraping with Python.

This marks the end of the 10 most Important Functions in BeautifulSoup article. Any suggestions or contributions for CodersLegacy are more than welcome. Questions regarding the tutorial content can be asked in the comments section below.

Share on Facebook