How To Develop Your First Web Crawler Using Python Scrapy
Crawlers for searches across multiple pages
In this post, I am going to write a web crawler that will scrape data from OLX’s Electronics & Appliances items. But before I get into the code, here’s a brief intro to Scrapy itself.
What is Scrapy?
From Wikipedia:
Scrapy (pronounced skray-pee)[1] is a free and open source web crawling framework, written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general purpose web crawler.[2] It is currently maintained by Scrapinghub Ltd., a web scraping development and services company.
Creating a Project
Scrapy introduces the idea of a project with multiple crawlers or spiders in a single project. This concept is helpful, especially if you are writing multiple crawlers of different sections or subdomains of a site. So, first create the project:
Adnans-MBP:ScrapyCrawlers AdnanAhmad$ scrapy startproject olxNew Scrapy project 'olx', using template directory '//anaconda/lib/python2.7/site-packages/scrapy/templates/project', created in:
/Development/PetProjects/ScrapyCrawlers/olx
You can start your first spider with:
cd olx
scrapy genspider example example.com
Creating Your Crawler
I ran the command scrapy startproject olx
, which will create a project with the name olx and helpful information for your next steps. You go to the newly created folder and then execute the command for generating the first spider with the name and domain of the site to be crawled:
Adnans-MBP:ScrapyCrawlers AdnanAhmad$ cd olx/Adnans-MBP:olx AdnanAhmad$ scrapy genspider electronics www.olx.com.pk
Created spider 'electronics' using template 'basic' in module:
olx.spiders.electronics
I generated the code of my first spider with the name electronics since I am accessing the electronics section of OLX. You can name your spider anything you want.
The final project structure will be something like the example below:
As you can see, there is a separate folder only for spiders. You can add multiple spiders within a single project. Let’s open the electronics.py
spider file. When you open it, you will see something like this:
As you can see, ElectronicsSpider
is a subclass of scrapy.Spider
. The name
property is actually the name of the spider, which was given in the spider generation command. This name will help while running the crawler itself. The allowed_domains
property tells us which domains are accessible for this crawler, and start_urls
is the place to mention initial URLs that need to be accessed in the first place. Besides the file structure, this is a good feature to create the boundaries of your crawler.
The parse
method, as the name suggests, will parse the content of the page being accessed. Since I want to write a crawler that goes to multiple pages, I am going to make a few changes.
In order to make the crawler navigate to several pages, I subclassed my crawler from crawler instead of scrapy.Spider
. This class makes crawling many pages of a site easier. You can do something similar with the generated code, but you’ll need to take care of recursion to navigate the next pages.
The next step is to set your rules variable. Here you mention the rules of navigating the site. The LinkExtractor
actually takes parameters to draw navigation boundaries. Here I am using restrict_css
parameter to set the class for the NEXT page. If you go to this page and inspect element, you can find something like this:
pageNextPrev
is the class that will be used to fetch the links to the next pages. The call_back
parameter tells which method to use to access the page elements. We will work on this method soon.
Do remember, you need to change the name of the method from parse()
to parse_item()
, or whatever you choose, to avoid overriding the base class, otherwise your rule will not work, even if you set follow=True
.
So far, so good; let’s test the crawler I have made so far. Go to your Terminal, in your project directory and type:
scrapy crawl electronics
The third parameter is actually the name of the spider that was set earlier in the name
property of ElectronicsSpiders
class. In your Terminal, you will find lots of useful information that is helpful to debug your crawler. You can disable the debugger if you don’t want to see debugging information. The command will be similar with --nolog
switch.
scrapy crawl --nolog electronics
If you run now, it will print something like:
Adnans-MBP:olx AdnanAhmad$ scrapy crawl --nolog electronics
Processing..https://www.olx.com.pk/computers-accessories/?page=2
Processing..https://www.olx.com.pk/tv-video-audio/?page=2
Processing..https://www.olx.com.pk/games-entertainment/?page=2
Processing..https://www.olx.com.pk/computers-accessories/
Processing..https://www.olx.com.pk/tv-video-audio/
Processing..https://www.olx.com.pk/games-entertainment/
Processing..https://www.olx.com.pk/computers-accessories/?page=3
Processing..https://www.olx.com.pk/tv-video-audio/?page=3
Processing..https://www.olx.com.pk/games-entertainment/?page=3
Processing..https://www.olx.com.pk/computers-accessories/?page=4
Processing..https://www.olx.com.pk/tv-video-audio/?page=4
Processing..https://www.olx.com.pk/games-entertainment/?page=4
Processing..https://www.olx.com.pk/computers-accessories/?page=5
Processing..https://www.olx.com.pk/tv-video-audio/?page=5
Processing..https://www.olx.com.pk/games-entertainment/?page=5
Processing..https://www.olx.com.pk/computers-accessories/?page=6
Processing..https://www.olx.com.pk/tv-video-audio/?page=6
Processing..https://www.olx.com.pk/games-entertainment/?page=6
Processing..https://www.olx.com.pk/computers-accessories/?page=7
Processing..https://www.olx.com.pk/tv-video-audio/?page=7
Processing..https://www.olx.com.pk/games-entertainment/?page=7
Since I set follow=True
, the crawler will check the rule for the NEXT page and will keep navigating unless it hits the page where the rule does not mean, usually the last page of the listing.
Now, imagine that I am going to write a similar logic with the things mentioned in this post. First, I will have to write code to spawn multiple processes. I will also have to write code to navigate not only to the next page but also to restrict my script to stay inside the boundaries by not accessing unwanted URLs. Scrapy takes all these burdens off my shoulder and makes me focus on the main logic—that is, writing the crawler to extract information.
Now I am going to write code that will fetch individual item links from listing pages. I am going to modify code in my parse_item
method.
Here I am fetching links by using the .css
method of response. As I said, you can use xpath
as well—it’s up to you. In this case, it’s pretty simple:
The anchor link has a class detailsLink
. If I only use response.css('.detailsLink')
, then it’s going to pick duplicate links of a single entry due to the repetition of links in the img
and h3
tags. I also referred the parent class large
to get unique links. I used ::attr(href)
to extract the href
part that is the link itself. I then use theextract()
method.
The reason to use this method is that .css
and .xpath
return SelectorList
object, andextract()
helps to return the actual DOM for further processing. Finally I am yield
ing links in scrapy.Request
with a callback. I have not checked the inner code of Scrapy, but most likely they are using yield
instead of a return
because you can yield multiple items. Since the crawler needs to take care of multiple links together, then yield
is the best choice here.
The parse_detail_page
method, as the name explains, will parse individual information from the detail page. So what actually is happening is:
- You get a list of entries in
parse_item
. - You pass them on in a callback method for further processing.
Since it was only a two-level traverse, I was able to reach the lowest level with the help of two methods. If I was going to start crawling from the main page of OLX, I would have to write three methods here: the first two to fetch subcategories and their entries, and the last one for parsing the actual information. Got it?
Finally, I am going to parse the actual information, which is available on one of the entries like this one.
Parsing information from this page is not different, but here’s something that to is done to store parsed information. We need to define a model for our data. That means we need to tell Scrapy what information we want to store for later use. Let’s edit theitem.py
file, which was generated earlier by Scrapy.
OlxItem
is the class in which I will set the required fields to hold information. I am going to define three fields for my model class.
I am going to store the title of the post, the price, and the URL itself.
Let’s get back to crawler class and modify parse_detail_page
.
Now, one method is to start writing code, test by running the entire crawler, and see whether you’re on the right track or not, but there’s another awesome tool provided by Scrapy.
Scrapy Shell
Scrapy Shell is a command-line tool that provides you with an opportunity to test your parsing code without running the entire crawler. Unlike the crawler, which goes to all the links, Scrapy Shell saves the DOM of an individual page for data extraction. In my case, I did the following:
Adnans-MBP:olx AdnanAhmad$ scrapy shell https://www.olx.com.pk/item/asus-eee-pc-atom-dual-core-4cpus-beautiful-laptops-fresh-stock-IDUVo6B.html#4001329891
Now I can easily test the code without hitting the same URL again and again. I fetched the title by doing this:
In [8]: response.css('h1::text').extract()[0].strip()
Out[8]: u"Asus Eee PC Atom Dual-Core 4CPU's Beautiful Laptops fresh Stock"
You can find the familiar response.css
here. Since the entire DOM is available, you can play with it.
And I fetch price by doing this:
In [11]: response.css('.pricelabel > strong::text').extract()[0]
Out[11]: u'Rs 10,500'
No need to do anything for fetching URL since response.url
returns the currently accessed URL.
Now that all code is checked, it’s time to incorporate it in parse_detail_page
:
title = response.css('h1::text').extract()[0].strip()
price = response.css('.pricelabel > strong::text').extract()[0]
item = OlxItem()
item['title'] = title
item['price'] = price
item['url'] = response.url
yield item
After parsing the required information, OlxItem
instance is being created and properties are being set. Now that it’s time to run crawler and store information, there’s a slight modification in command:
scrapy crawl electronics -o data.csv -t csv
I am passing the file name and file format for saving data. Once run, it will generate CSV for you. Easy, isn’t it? Unlike the crawler you are writing on your own, you have to write your own routine for saving data.
But wait! It does not end here—you can even get data in JSON format; all you have to do is to pass json
with -t
switch.
Scrapy provides you another feature. Passing a fixed file name does not make any sense in real-world scenarios. How could I have some facility to generate unique file names? Well, for that you need to modify settings.py
file and add these two entries:
FEED_URI = 'data/%(name)s/%(time)s.json'
FEED_FORMAT = 'json'
Here I am giving the pattern of my file, %(name)%
is the name of the crawler itself, and time
is a timestamp. You may learn further about it here. Now when I run scrapy crawl --nolog electronics
or scrapy crawl electronics
, it will generate a JSON file in data
folder, like this:
[
{"url": "https://www.olx.com.pk/item/acer-ultra-slim-gaming-laptop-with-amd-fx-processor-3gb-dedicated-IDUQ1k9.html", "price": "Rs 42,000", "title": "Acer Ultra Slim Gaming Laptop with AMD FX Processor 3GB Dedicated"},
{"url": "https://www.olx.com.pk/item/saw-machine-IDUYww5.html", "price": "Rs 80,000", "title": "Saw Machine"},
{"url": "https://www.olx.com.pk/item/laptop-hp-probook-6570b-core-i-5-3rd-gen-IDUYejF.html", "price": "Rs 22,000", "title": "Laptop HP Probook 6570b Core i 5 3rd Gen"},
{"url": "https://www.olx.com.pk/item/zong-4g-could-mifi-anlock-all-sim-supported-IDUYedh.html", "price": "Rs 4,000", "title": "Zong 4g could mifi anlock all Sim supported"},
...
]
Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper API provides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a post about how to use it.
Click here to signup with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.