Web scraping is a technique that is employed to mine huge amounts of data or information from the websites whereby the extracted information is stored in a local file of a computer or to a database in a structured spreadsheet format. The web is said to be an ocean of information and at some point, we all require gathering some sort of statistics. Thus, there are various web scraping tools introduced with a purpose to scrape data from website within a very less span of time and Python is one among them. A Python is an open source programming language and basically, there are two modules of scraping data:
- Urllib2 – It is a module that is used for fetching URLs. It outlines functions and classes to support with URL actions such as redirections, cookies, and basic and digest authentication, etc.
- Beautiful Soup – It is considered as an incredible tool for dragging out the information from several web pages. This module can be used to extract lists, paragraphs, tables, and texts and also, one can apply various filters as well to make the mining process easy and convenient.
How to scrape data from website using Beautiful Soup:
- Import necessary libraries
- Use function “prettify” to look at nested structure of HTML page
- Work with HTML tags
- <tag>
- <tag>.string
- Find all the links within page’s <a> tags
- Find the right table
- Extract the information to Data Frame
Therefore, code written in Beautiful Soup is usually stouter than the codes written in regular expressions. Codes in the regular expression are required to be altered with any further changes in web pages. In addition, codes written with regular expression are faster usually by a factor of 100 giving the same outcome. Further, various other types of scraping can be done using Beautiful Soup as it will reduce manual efforts to scrape data from website. One can also look for other attributes like .contents, .parents, .descendants, .prev_sibling, and .next_sibling to navigate using tag name. This will be helpful in scraping the data from several web pages effectively.
No comments:
Post a Comment