How To Develop Your First Web Crawler Using Python Scrapy
Crawlers for searches across multiple pages
--
In this post, I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances items. But before I get into the code, here’s a brief intro to Scrapy itself.
What is Scrapy?
From Wikipedia:
Scrapy (pronounced skray-pee)[1] is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.
Creating a Project
Scrapy introduces the idea of a project with multiple crawlers or spiders in a single project. This concept is helpful, especially if you are writing multiple crawlers of different sections or subdomains of a site. So, first create the project:
Adnans-MBP:ScrapyCrawlers AdnanAhmad$ scrapy startproject olxNew Scrapy project 'olx', using template directory '//anaconda/lib/python2.7/site-packages/scrapy/templates/project', created in:
/Development/PetProjects/ScrapyCrawlers/olx
You can start your first spider with:
cd olx
scrapy genspider example example.com
Creating Your Crawler
I ran the command scrapy startproject olx
, which will create a project with the name olx and helpful information for your next steps. You go to the newly created folder and then execute the command for generating the first spider with the name and domain of the site to be crawled:
Adnans-MBP:ScrapyCrawlers AdnanAhmad$ cd olx/Adnans-MBP:olx AdnanAhmad$ scrapy genspider electronics www.olx.com.pk
Created spider 'electronics' using template 'basic' in module:
olx.spiders.electronics
I generated the code of my first spider with the name electronics since I am accessing the electronics section of OLX. You can name your spider anything you want.
The final project structure will be something like the example below: