Web Scraping

From Grundy
Jump to navigation Jump to search

Imagine you are looking for a job online. There are hundreds of websites and job profiles on all these websites but you want to shortlist companies and profiles of your interest. One approach would be to manually visit all these websites, search for relevant job profiles, go through all of them and then list down the company names that you are interested in. But what if you could have a program do all this for you and give you just a list of offers with job profile of your liking. Web scraping is the technique of automating this process so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.

There are mainly two ways to extract data from a website:

  • Extract data by accessing and manipulating the HTML content of the website. This technique is called web scraping or web harvesting or web data extraction.
  • Most websites also provide APIs that allow you to access their data in a predefined manner. ( see the next section)

Web Scraping can be used to compare product reviews/prices from various e-commerce sites, monitor social media to gather the latest trends/hashtags. You can also automate your browser to do tasks such as buying your favourite band's concert tickets as soon as they go up for sale, notify you if your exam results are available and much more.

This section will cover how to implement web scraping using python.

Prerequisites

A basic knowledge of HTML and Python is recommended, although Java and C# are also used by many web scrapers. Check out our Absolute Newbie and First Programming Language guide for more.

Steps involved in web scraping

There are a number of steps involved in scraping the website. The advantage of using Python is that there are many python libraries and modules which can be used for different steps involved in web scraping as well as further manipulation of the data extracted. The easiest way to install external libraries in python is to use pip. pip is a package management system used to install and manage software packages written in Python.

  • Accessing the HTML content of the website of your choice. For this, we need to send an HTTP request to the URL of the website which will return the HTML content. This is done using an HTTP library/module for python-requests. The python Requests module is used for this purpose. Requests module allows sending different HTTP requests like GET, POST very easily. To install requests module using pip, use pip install requests
  • Thus, we have access to the HTML content. Now we need to parse the HTML content, i.e. analyse and identify the parts of the HTML content to extract the relevant data. For this purpose, another python library comes to use - Beautiful soup. Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML. To install requests module using pip, use pip install bs4 or pip install beautifulsoup4
  • We have access to the HTML and Beautiful soup functions to parse the HTML content. The important step is to observe and inspect the HTML structure of the website and to figure out what data you want to extract and what is the structure of the HTML tags surrounding that data. Use this knowledge then to write your code and extract the required data.

Example - Follow this link for a complete tutorial on web scraping - starting from inspecting the website to finally writing the code to extract the data. You may choose to skip the introduction section given on that website and directly start from here

Resources

  • This is an amazing tutorial to help you get started. Brownie points for doing the Practice Projects mentioned in the end.
  • Beautiful Soup is a tool that can be used to easily parse HTML code.
  • When websites are dynamic and require some sort of interaction(clicking, hovering, entering text) to reveal data, browser automation comes in handy. Selenium is one of the best browser automation tools available. Check out the excellent unofficial documentation on Selenium.
  • This link contains an exhaustive list of tools and libraries used in browser automation and web scraping using python. You can also check out the original repository to get information about the tools and libraries used for web scraping in other languages.

Disclaimer

  • Some websites do prohibit the use of robots(i.e web scrapers) to gather information from them, so it is best to read the Website User Agreement before proceeding for the same.
  • Here is a helpful article concerning legality of web scraping.

See also