19th Nov 2020

Let’s scrape website data ethically

AI

Written By, Ayaz Saiyed

Overview:

ethical-hacking

As we know, everything is possible in 2020 !!

So, let’s see how to scrap the data from any website for educational purposes legally and ethically without using any API or third party.

Note – Don’t use it for unethical use, as it may lead against law, if terms & conditions of the website mention restrictions on access of its data.

What actually is Web Scraping ?

Web scraping is a technique which can help to automatically access and extract large amounts of information from a public website, without using any API or human intervention.
ethical-hacking-web-scraping

Web scraping uses intelligent automated techniques to retrieve hundreds, millions, or even billions of data from the internet.

Web scraping in Real World

Price Comparison : Websites such as ParseHub use web scraping to collect data from online shopping websites and use it to compare the prices of products.

Email address gathering : Many of the companies use web scraping to collect email ID for lead generation and then send bulk emails.

Social Media Scraping : Web scraping can also be used to collect data from websites such as twitter to find out what’s trending in the world.

Research and Development : Web scraping can be used to collect a large amount of data ( Statistics, General Information, Medical Stats, etc ) from websites, which can be analyzed and used by algorithms or Research work.

Job listings : As many of the websites share job openings, government newsletters,etc are scraped from different websites and then listed in one place so that it can easily be accessible to the user.

Two bones of scraping process

crawler

A web crawler, which can be also known as “spider”. It is an artificial intelligence / algorithm that browses the internet to index and search for required contents from links, better and faster than a human with thousands of hands.

scraper

A web scraper is a unique and best tool designed to accurately and quickly extract required data from any web page directly. Web scrapers may vary widely in functionality, design and complexity, depending on the project size.

Let’s do some scraping

As covid is still around us, so let’s try our scraping process on the same.

Here we will scrap the covid realtime data using python from a website named Worldometer. It publishes updated numbers of total affected, killed and cured patients on its website accurately.

Let’s make a Python script using two main libraries, bs4 ( BeautifulSoup ) and requests to scrape COVID-19 counts from the Worldometer.

Import the necessary libraries

import-libraries

Here, requests module is used to send HTTP requests to the server and receive the HTTP response back.

So we will use requests, to send HTTP requests using the website’s URL and get back in return the response, for which we will be using Beautiful Soup (bs4) module to take out the useful information or content of the website from the response.

Let’s send request to the website

send-request

Here, we initialized a header for the request and sent it to the URL of the worldometers website.

And yes, don’t forgot to put the timeout while sending the request while scraping,

Timeout is not a time limit on the entire response we fetch from the request, it is an exception which is raised, if the server has not issued a response under specified timeout seconds.

Normally, the timeout value must be timeout=5, if you have stable internet connection.

Time to write a function, which can scrape website data

write-function

As you can find the required <div> tag, by using the Chrome’s Developer Tools to inspect the element showing the number of deaths, cases, you will find that it’s a div tag with class maincounter-number which we are using to fetch the data.

Now, let’s create a function named extract_global, which includes the code to find all the <div> tags from the source code of the worldometer website and filter the required <div> tags which contain class maincounter-number.

maincounter-number

So, here we go.

scrapped-img

We have scrapped the required data from the worldometers website, let’s check by getting it printed on the command prompt.

stats_global

Just add a few lines to the existing function named stats_global(), to assign the response to the variables and pass them to the function, which prints them.

Now, add the final function ScrapedResult() to print results.

ScrapedResult

And here are the scrapped realtime details from the live worldometer website.

realtime-details

As you have the data – you can utilise it as per your need,

In the same way I had fetched all data and tried to present them in an interactive way on a website.

live-status

Check it out live here – https://live-covid-tracker.herokuapp.com

Conclusion

Here I conclude the scraping and suggest you not to scrape any website without reading their terms and policies about data access and privacy.

Techno Trivia – ‘robots.txt’ is a file used by websites to let ‘bots’ know if or how the site should be crawled and scraped.

To check website is eligible for scraping or not, checkout ‘robots.txt’ of that website.
Usage – [websitelink]/robots.txt ( For ex – https://yudiz.com/robots.txt )

Hope you find this blog helpful, Keep Scraping !!

Written By,

Python Developer at Yudiz Solutions Pvt. Ltd