Beautiful Soup

From Grundy
Jump to navigation Jump to search


Essentially, Beautiful Soup is a python library used for Web Scraping, which refers to programmatically extracting data from the internet. The bs4.BeautifulSoup must be called with a string that contains all the html code you need to parse and returns a BeautifulSoup object. The advantage of storing the HTML code in this format is that using the select() method, it is very easy to extract data that resides in any web page elements, like a particular class or an unordered list lying inside a div in a section in the body.
I personally found this tutorial to be very helpful and can be easily followed even by someone with no previous knowledge about parsing.

Prerequisites

All one needs to know to use this really cool library are the bare basics of python (or any programming language for that matter) and the ability to recognize patterns in an HMTL code. Proficiency in CSS, HTML and web development is not a must.

An Example on the Use of Beautiful Soup

The best way to understand the use of this library would be to follow an example and I would like to demonstrate through the program I used it for. I needed to go through all the songs of a particular artist and find out how often a particular word was used. To do so, I first obtain the html code of the webpage containing the home-page of the artist (http://www.metrolyrics.com/adele-lyrics.html) and store it in a file (code.html) in the same working directory.
First, you must have bs4 installed using:
sudo pip install bs4

Next, you must import the BeatifulSoup module using:
import bs4

Now that you have the code stored in the file code.html, create a 'file':
file = open ("tp.html")

We can now create the BeautifulSoup object:
soup = bs4.BeautifulSoup(file , "lxml")

Now we must take a look at the HTML code. We need a list of the URLs for each of the songs of this artist and on looking through the code we realize that these are placed within an 'a' element within a 'td' element within a 'tr' element within a 'tbody' element. We can create a list of all such elements:
songslist = soup.select('tbody tr td a')

The href element of each of the element of this list is the URL for one of the artist's songs. We can go through the list and extract the URL:
url = songslist[ii].get('href')

Having obtained the link for any particular song, we have to review the HTML code of a general song page and figure out how to extract the lyrics in order to count the frequency of any word.
This was a very basic example but should suffice to solve a variety of cool tasks quickly. Hope you enjoyed it!

Resources

  • Like I mentioned before this tutorial is very helpful and should serve your purpose.
  • This is the official documentation for the library and contains all information on all the member functions and how to use them best.
  • Here is another helpful link.

See also