When it comes to data gathering, web scraping plays an instrumental role in getting us a good amount and variety of data from the internet. For data science enthusiasts, the amount of data available through the Web is like water in the ocean. For instance, you are making your mind up to buy some product through an online platform – how would it be if you scrape ecommerce sites and pull the review for the same product from the buyers! The use cases are infinite.
By the time this article gets completed, you would know scraping using Scrapy framework and would have scrapped multiple websites – let’s get through the article!
Overview of Scrapy
Scrapy is a python based web scraping framework used to extract the data you need from websites. In a fast, simple, yet extensible way.
Getting started with Scrapy
Before getting started, just make sure you have python installed in your system.
By writing “python” in your cmd you can check whether it is successfully installed or not.
On the successful installation, you will get your current python version. Here, I have 3.10.2 version installed in my local system
Now using the following command, you can install Scrapy framework in your local system.
pip install scrapy
Creating first scrapy project
Let’s create a new scrapy project by executing the following command.
scrapy startproject NameOfProject
In my case, I am using spider name “FirstScraper” which would be something like this:
scrapy startproject FirstScraper
This will create a folder “FirstScraper” with the following structure:
For now, there are two files are important:
- settings.py : Settings of your project.
- spiders/ : Folder where all spiders created are stored. The moment you ask scrapy to run a spider, it will look for it into this folder.
Creating first spider in Scrapy
In Scrapy, basically, spider is nothing but a python file where all the scrapping scripts are written.
First of all let’s change the directory into our FirstScraper and create our first spider.
By using cmd “genspider”, we can make spiders automatically.
It takes two arguments: the first one is the name of your spider and the letter one is website url.
This will create a new spider file “QuotesScraper.py” in your spiders/ folder with the following basic template:
Something to be noted here:
- name : Name of the spider, in our case it is “QuotesScraper”.
- allowed_domains : List of urls which have to be only followed by our spider.
- parse(self, response) : On successful scraping, this function gets called and through the response object we can extract all the data we want from the web of the given url.
Scraping a website using Scrapy
Here is a website Quotes to Scrape, typically used for the purpose to learn scraping.And, we would scrape the same.
From the mentioned site I want to scrape all the quotes and their author names. Henceforth, I have created the following XPath in my parse function.
XPath: XPath is nothing but the address of a specific HTML element.
Now, by following command this site can be crawled,
scrapy crawl QuotesScraper
Here, I have written the name of my spider “QuotesScraper”.In your case, it would be different.
Now, in our terminal, we can see all scraped quotes with their author name.
Now it is the time to see how Scrapy is different from other web scraping tools.
In Scrapy, using a single line of command we can extract all the scraped data in the form of json, xml and csv file.
To exemplify, if we want the same scraped data in csv format. All we need to do is just perform the following command.
Scrapy crawl SpiderName -O FileName.Format(csv,xml,json)
In my case,I want all the data in csv format.So, it would be something like this.
Scrapy crawl QuotesScraper ScrapedQuotes.csv
As a result, I got a file named “ScrapedQuotes.csv” in my current directory of Scrapy project which looks something like this.
Apart from this, Scrapy facilitates many different functionalities, such as following other links, logging forms in with csrf(cross site request forgery), bypassing restrictions using user-agents and proxies; disobedience of robots, etc. which all can be learnt through Documentation of it.
To sum up, although there are myriad web crawling tools available in today’s world, Scrapy offers tremendous benefits as compared to others which can really be a boon for us to crawl websites. In a fast, simple, yet extensible way.