Web Scraping Using Python

Web Scraping is a technique to extract a large amount of data from several websites. The term “scraping” refers to obtaining the information from another source (webpages) and saving it into a local file. For example, suppose you are working on a project called “Phone comparing website,” where you require the price of mobile phones, ratings, and model names to make comparisons between the different mobile phones.

If you collect these details by checking various sites, it will take a lot of time. In that case, web scraping plays an important role, whereby, by writing a few lines of code, you can get the desired results.

Web scraping extracts the data from websites in an unstructured format. It helps to collect this unstructured data and convert it into a structured form.

Startups prefer web scraping because it is a cheap and effective way to get a large amount of data without any partnership with the data-selling company.

Is Web Scraping legal?

Here, the question arises whether web scraping is legal or not. The answer is that some sites allow it when used legally. Web scraping is just a tool; you can use it in the right way or the wrong way.

Web scraping is illegal if someone tries to scrape nonpublic data. Nonpublic data is not accessible to everyone; if you try to extract such data, then it is a violation of the legal term.

There are several tools available to scrape data from websites, such as:

Scraping-bot
Scrapper API
Octoparse
Import.io
Webhose.io
Dexi.io
Outwit
Diffbot
Content Grabber
Mozenda
Web Scrapper Chrome Extension

Why Web Scraping?

As we have discussed above, web scraping is used to extract data from websites. But we should know how to use that raw data. That raw data can be used in various fields. Let’s have a look at the usage of web scraping:

Dynamic Price Monitoring

It is widely used to collect data from several online shopping sites, compare the prices of products, and make profitable pricing decisions. Price monitoring using web-scraped data gives companies the ability to know the market conditions and facilitate dynamic pricing. It ensures that the companies always outrank others.

Dynamic Price Monitoring

Web Scraping is perfectly appropriate for market trend analysis. It is gaining insights into a particular market. The large organization requires a great deal of data, and web scraping provides the data with a guaranteed level of reliability and accuracy.

Email Gathering

Many companies use personal email data for email marketing. They can target the specific audience for their marketing.

News and Content Monitoring

A single news cycle can create an outstanding effect or a genuine threat to your business. If your company depends on the news analysis of an organization, it frequently appears in the news. So, web scraping provides the ultimate solution to monitoring and parsing the most critical stories. News articles and social media platforms can directly influence the stock market.

Social Media Scraping

Web Scraping plays an essential role in extracting data from social media websites such as Twitter, Facebook, and Instagram, to find the trending topics.

Research and Development

A large set of data, such as general information, statistics, and temperature, is scraped from websites, which is analyzed and used to carry out surveys or research and development.

Why use Python for Web Scraping?

There are other popular programming languages, but why do we choose Python over other programming languages for web scraping? Below, we are describing a list of Python’s features that make it the most useful programming language for web scraping.

Dynamically Typed

In Python, we don’t need to define data types for variables; we can directly use the variable wherever it is required. It saves time and makes a task faster. Python defines its classes to identify the data type of a variable.

A vast collection of libraries

Python comes with an extensive range of libraries, such as NumPy, Matplotlib, Pandas, Scipy, etc., that provide flexibility to work with various purposes. It is suited for almost every emerging field and also for web scraping for extracting data and performing manipulation.

Less Code

The purpose of web scraping is to save time. But what if you spend more time writing the code? That’s why we use Python, as it can perform a task in a few lines of code.

Open-Source Community

Python is open-source, which means it is freely available for everyone. It has one of the biggest communities across the world where you can seek help if you get stuck anywhere in Python code.

The basics of web scraping

Web scraping consists of two parts: a web crawler and a web scraper. In simple words, the web crawler is a horse, and the scraper is the chariot. The crawler leads the scraper and extracts the requested data. Let’s understand these two components of web scraping:

The crawler

A web crawler is generally called a “spider.” It is an artificial intelligence technology that browses the internet to index and search for content by given links. It searches for the relevant information asked by the programmer.

The Scrapper

A web scraper is a dedicated tool that is designed to extract data from several websites quickly and effectively. Web scrapers vary widely in design and complexity, depending on the projects.

How does Web Scraping work?

These are the following steps to perform web scraping. Let’s understand the working of web scraping.

Step 1: Find the URL that you want to scrape

First, you should understand the requirements of data according to your project. A webpage or website contains a large amount of information. That’s why scrap only relevant information. In simple words, the developer should be familiar with the data requirements.

Step 2: Inspecting the Page

The data is extracted in raw HTML format, which must be carefully parsed to reduce the noise from the raw data. In some cases, data can be as simple as name and address or as complex as high-dimensional weather and stock market data.

Step 3: Write the code

Write a code to extract the information, provide relevant information, and run the code.

Step 4: Store the data in the file

Store that information in the required CSV, XML, or JSON file format.

Getting Started with Web Scraping

Python has a vast collection of libraries and also provides a very useful library for web scraping. Let’s understand the required library for Python.

Libraries for web scraping

There are several libraries for web scraping in Python. Some of them are as follows:

Selenium

Selenium is an open-source automated testing library. It is used to check browser activities. To install this library, type the following command in your terminal.

pip install selenium

Note: It is good to use the PyCharm IDE.

Pandas

The pandas library is used for data manipulation and analysis. It is used to extract the data and store it in the desired format.

BeautifulSoup

BeautifulSoup is a Python library that is used to pull data from HTML and XML files. It is mainly designed for web scraping. It works with the parser to provide a natural way of navigating, searching, and modifying the parse tree. The latest version of BeautifulSoup is 4.8.1.

Let’s understand the BeautifulSoup library in detail, and the Installation of BeautifulSoup:

You can install BeautifulSoup by typing the following command:

pip install bs4

Installing a parser

BeautifulSoup supports HTML parser and several third-party Python parsers. You can install any of them according to your dependency. The list of BeautifulSoup’s parsers is the following:

Parser	Typical usage
Python’s html.parser	BeautifulSoup(markup,”html.parser”)
lxml’s HTML parser	BeautifulSoup(markup,”lxml”)
lxml’s XML parser	BeautifulSoup(markup,”lxml-xml”)
Html5lib	BeautifulSoup(markup,”html5lib”)

We recommend that you install html5lib parser because it is more suitable for the newer version of Python, or you can install the lxml parser.

Type the following command in your terminal:

pip install html5lib

BeautifulSoup Objects: Tag, Attributes, and NavigableString

BeautifulSoup is used to transform a complex HTML document into a complex tree of Python objects. But there are a few essential types of objects that are mostly used:

Tag

A Tag object corresponds to an XML or HTML original document.

soup = bs4.BeautifulSoup("<b class = "boldest">Extremely bold</b>)  

tag = soup.b  

type(tag)

Output:

<class "bs4.element.Tag">

A tag contains a lot of attributes and methods, but the most important features of a tag are the name and the attribute.

2. Name

Every tag has a name, accessible as .name:

tag.name

3. Attributes

A tag may have any number of attributes. The tag <b id = “boldest”> has an attribute “id” whose value is “boldest”. We can access a tag’s attributes by treating the tag as a dictionary.

tag[id]

We can add, remove, and modify a tag’s attributes. It can be done by using a tag as a dictionary.

# add the element  

tag['id'] = 'verybold'  

tag['another-attribute'] = 1  

tag  

# delete the tag  

del tag['id']

4. Multi-valued Attributes

In HTML5, there are some attributes that can have multiple values. The class (consisting of more than one CSS) is the most common multivalued attribute. Other attributes are rel, rev, accept-charset, headers, and accesskey.

  class_is_multi= { '*' : 'class'}  

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)  

xml_soup.p['class']  

# [u'body', u'strikeout']

5. NavigableString

A string in BeautifulSoup refers to text within a tag. BeautifulSoup uses the NavigableString class to contain these bits of text.

 tag.string  

# u'Extremely bold'  

type(tag.string)  

# <class 'bs4.element.NavigableString'>

A string is immutable, which means it can’t be edited. But it can be replaced with another string using replace_with().

tag.string.replace_with("No longer bold")  

tag

In some cases, if you want to use a NavigableString outside the BeautifulSoup, the unicode() helps it to turn into a normal Python Unicode string.

6. BeautifulSoup object

The BeautifulSoup object represents the complete parsed document as a whole. In many cases, we can use it as a Tag object. It means it supports most of the methods described in navigating the tree and searching the tree.

doc=BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")  

footer=BeautifulSoup("<footer>Here's the footer</footer>","xml")  

doc.find(text="INSERT FOOTER HERE").replace_with(footer)  

print(doc)

Output:

?xml version="1.0" encoding="utf-8"?>
# <document><content/><footer>Here's the footer</footer></document>

Web Scraping Example

Let’s take an example to understand the scraping practically by extracting the data from the webpage and inspecting the whole page.

First, open your favorite page on Wikipedia and inspect the whole page, and before extracting data from the webpage, you should ensure that you meet your requirements. Consider the following code:

#importing the BeautifulSoup Library  

  

importbs4  

import requests  

  

#Creating the requests  

  

res = requests.get("https://en.wikipedia.org/wiki/Machine_learning")  

print("The object type:",type(res))  

  

# Convert the request object to the Beautiful Soup Object  

soup = bs4.BeautifulSoup(res.text,'html5lib')  

print("The object type:",type(soup)

Output:

The object type <class 'requests.models.Response'>
Convert the object into: <class 'bs4.BeautifulSoup'>
In the following lines of code, we are extracting all headings of a webpage by class name. Here, front-end knowledge plays an essential role in inspecting the webpage.

  soup.select('.mw-headline')  

for i in soup.select('.mw-headline'):  

print(i.text,end = ',')

Output:

Overview,Machine learning tasks,History and relationships to other fields,Relation to data mining,Relation to optimization,Relation to statistics, Theory,Approaches,Types of learning algorithms,Supervised learning,Unsupervised learning,Reinforcement learning,Self-learning,Feature learning,Sparse dictionary learning,Anomaly detection,Association rules,Models,Artificial neural networks,Decision trees,Support vector machines,Regression analysis,Bayesian networks,Genetic algorithms,Training models,Federated learning,Applications,Limitations,Bias,Model assessments,Ethics,Software,Free and open-source software,Proprietary software with free and open-source editions,Proprietary software,Journals,Conferences,See also,References,Further reading,External links,

Explanation:

In the above code, we imported the bs4 and requested the library. In the third line, we created a res object to send a request to the webpage. As you can observe, we have extracted all headings from the webpage.

Webpage of Wikipedia Learning

Let’s understand another example; we will make a GET request to the URL and create a parse Tree object (soup) with the use of BeautifulSoup and Python built-in “html5lib” parser.

Here, we will scrape the webpage of the given link. Consider the following code:

Following code:  

# importing the libraries  

from bs4 import BeautifulSoup  

import requests  

  

url=""  

  

# Make a GET request to fetch the raw HTML content  

html_content = requests.get(url).text  

  

# Parse the html content  

soup = BeautifulSoup(html_content, "html5lib")  

print(soup.prettify()) # print the parsed data of html

The above code will display all the HTML code of the TpointTech homepage.

Using the BeautifulSoup object, i.e., soup, we can collect the required data table. Let’s print some interesting information using the soup object:

Let’s print the title of the web page.

print(soup.title)

Output:

<title>Tutorials List - Python App</title>

In the above output, the HTML tag is included with the title. If you want text without a tag, you can use the following code:

print(soup.title.text)

Output:

Tutorials List - Python App

We can get the entire link on the page along with its attributes, such as href, title, and its inner Text. Consider the following code:

for link in soup.find_all("a"):    

print("Inner Text is: {}".format(link.text))    

print("Title is: {}".format(link.get("title")))    

print("href is: {}".format(link.get("href")))

Output:

href is: https://www.facebook.com/Python App
Inner Text is:
The title is: None
href is: https://twitter.com/Python App
Inner Text is:
The title is: None
href is: https://www.youtube.com/channel/UCUnYvQVCrJoFWZhKK3O2xLg
Inner Text is:
The title is: None
href is: https://Python App.blogspot.com
Inner Text is: Learn Java
Title is: None
href is: java-tutorial
Inner Text is: Learn Data Structures
Title is: None
href is: data-structure-tutorial
Inner Text is: Learn C Programming
Title is: None
href is: c-programming-language-tutorial
Inner Text is: Learn C++ Tutorial

Demo: Scraping Data from Flipkart Website

In this example, we will scrape the mobile phone prices, ratings, and model names from Flipkart, which is one of the popular e-commerce websites. The following are the prerequisites to accomplish this task:

Prerequisites:

Python 2.x or Python 3.x with Selenium, BeautifulSoup, and Pandas libraries installed.
Google Chrome browser
Scraping Parser, such as HTML.parser, XML, etc.

Step 1: Find the desired URL to scrape

The initial step is to find the URL that you want to scrape. Here we are extracting mobile phone details from Flipkart. The URL of this page is https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off.

Step 2: Inspecting the page

It is necessary to inspect the page carefully because the data is usually contained within the tags. So we need to inspect to select the desired tag. To inspect the page, right-click on the element and click "inspect".

Step 3: Find the data for extracting

Extract the Price, Name, and Rating, which are contained in the "div" tag, respectively.

Step 4: Write the Code

from bs4 import BeautifulSoup as soup    

from urllib.request import urlopen as uReq    

    

# Request from the webpage    

myurl = "https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"    

    

    

uClient  = uReq(myurl)    

page_html = uClient.read()    

uClient.close()    

    

page_soup = soup(page_html, features="html.parser")    

    

# print(soup.prettify(containers[0]))    

    

# This variable holds all html of the webpage    

containers = page_soup.find_all("div",{"class": "_3O0U0u"})    

# container = containers[0]    

# # print(soup.prettify(container))    

#    

# price = container.find_all("div",{"class": "col col-5-12 _2o7WAb"})    

# print(price[0].text)    

#    

# ratings = container.find_all("div",{"class": "niH0FQ"})    

# print(ratings[0].text)    

#    

# #    

# # print(len(containers))    

# print(container.div.img["alt"])    

    

# Creating a CSV File that will store all data     

filename = "product1.csv"    

f = open(filename,"w")    

    

headers = "Product_Name,Pricing,Ratings\n"    

f.write(headers)    

    

for container in containers:    

    product_name = container.div.img["alt"]    

    

    price_container = container.find_all("div", {"class": "col col-5-12 _2o7WAb"})    

    price = price_container[0].text.strip()    

    

    rating_container = container.find_all("div",{"class":"niH0FQ"})    

    ratings = rating_container[0].text    

    

# print("product_name:"+product_name)    

    # print("price:"+price)    

    # print("ratings:"+ str(ratings))    

    

     edit_price = ''.join(price.split(','))    

     sym_rupee = edit_price.split("?")    

     add_rs_price = "Rs"+sym_rupee[1]    

     split_price = add_rs_price.split("E")    

     final_price = split_price[0]    

    

     split_rating = str(ratings).split(" ")    

     final_rating = split_rating[0]    

    

     print(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")    

f.write(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")    

    

f.close()

Output:

We scrapped the details of the iPhone and saved those details in the CSV file, as you can see in the output. In the above code, we put a comment on a few lines of code for testing purposes. You can remove those comments and observe the output.

Conclusion

In this tutorial, we have learnt about web scraping, from all basic concepts of web scraping to the examples. We described the sample scraping from the leading online e-commerce site Flipkart. The legality of web scraping was discussed in order to follow the laws. We learnt about the uses of Web Scraping, such as Dynamic Price Monitoring, Social Media Scraping, Email Scraping, Gathering News, Content Monitoring, Research and Development. We also studied the importance of Web Scraping in Python, which includes Dynamically Typed, a Vast Collection of Libraries, Less Code, and efficiency.

Web Scraping Using Python

Is Web Scraping legal?

Why Web Scraping?

Dynamic Price Monitoring

Dynamic Price Monitoring

Email Gathering

News and Content Monitoring

Social Media Scraping

Research and Development

Why use Python for Web Scraping?

Dynamically Typed

A vast collection of libraries

Less Code

Open-Source Community

The basics of web scraping

The crawler

The Scrapper

How does Web Scraping work?

Libraries for web scraping

Selenium

Note: It is good to use the PyCharm IDE.

Pandas

BeautifulSoup

Installing a parser

BeautifulSoup Objects: Tag, Attributes, and NavigableString

Web Scraping Example

Webpage of Wikipedia Learning

Demo: Scraping Data from Flipkart Website

Prerequisites:

Conclusion

Comments

Leave a Reply Cancel reply

More posts

Python List copy() Method

Python List sort() Method

Python list index() method

Python List insert() Method