Image for post
Image for post

In this post of ScrapingTheFamous, I am going o write a scraper that will scrape data from eBay. eBay is an online auction site where people put their listing up for selling stuff based on an auction.

Like before, we will be writing the two scripts, one to fetch listing URs and store in a text file and the other to parse those links. The data will be stored in JSON format for further processing.

I will be using Scraper API service for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since it takes care of everything. …


Image for post
Image for post

In this post of ScrapingTheFamous, I am going o write a scraper that will scrape data from Amazon. I do not need to tell you what is Amazon. You are here because you already know about it 🙂

So, we are going to write two different scripts: one would be fetch.py that would be fetching URLs of individual listings and save in a text file. Later another script, parse.py that will have a function taking an individual listing URL, scrape data, and save in JSON format.

I will be using Scraper API service for parsing purposes which makes me free from all worries blocking and rendering dynamic sites since it takes care of everything. …


Learn how to create and consume Apache Avro based data for better and efficient transfer in your Python applications

Image for post
Image for post
https://unsplash.com/photos/LqKhnDzSF-8

In this post, I am going to talk about Apache Avro, an open-source data serialization system that is being used by tools like Spark, Kafka, and others for big data processing.

What is Apache Avro

According to Wikipedia:

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services. Avro uses a schema to structure the data that is being encoded. …


Image for post
Image for post

In this post, I am going to talk about Django Rest Framework or DRF. DRF is used to create RESTful APIs in Django which later could be consumed by various apps; mobile, web, desktop, etc. We will be discussing how to install DRF on your machine and then will be writing our APIs for a system.

Before we discuss DRF, let’s talk a bit about REST itself.

What is Rest

From Wikipedia

Representational state transfer (REST) is a software architectural style that defines a set of constraints to be used for creating Web services. Web services that conform to the REST architectural style, called RESTful Web services, provide interoperability between computer systems on the Internet. RESTful Web services allow the requesting systems to access and manipulate textual representations of Web resources by using a uniform and predefined set of stateless operations. …


Image for post
Image for post

I have been covering web scraping for a long time on this blog for a long time but they were mostly in Python; be it requests, Selenium or Scrapy framework, all were based on Python language but scraping is not limited to a specific language. Any language that provides APIs or libraries for an Http client and HTML parser is able to provide you web scraping facility. Go also provides you the ability to write web scrapers. Go is a compiled and static type language and could be very beneficial to write efficient and quick and scaleable web scrapers. …


Image for post
Image for post

Today I present you another library I made in Go language, called, Fehrist

From the Github README:

Fehrist is a pure Go library for indexing different types of documents. Currently, it supports only CSV and JSON but flexible architecture gives you the liberty to add more documents. Fehrist(فہرست) is an Urdu word for Index. Similar terminologies used in Arabic(فھرس) and Farsi(فہرست) as well.

Fehrist is based on an Inverted Index data structure for indexing purposes.

Why did I make it?

It seems I have fallen in love with Golang after Python. Go is an opinionated language that does not let you get distracted in various small decisions. The reason for making this particular lib is nothing but learning about indexing; how it works and what algorithms are available. I picked the Inverted Index due to its flexibility and relatively easier implemented than others like B+Trees. …


In this post, I am going to discuss another cloud-based scraping tool that takes care of many of the issues you usually face while scraping websites. This platform has been introduced by ScrapingBee, a cloud-based Scraping tool.

What is ScrapingBee

If you visit their website, you will find something like below:

ScrapingBee API handles headless browsers and rotates proxies for you.

Image for post
Image for post

As it suggests, it is offering you all the things to deal with the issues you usually come across while writing your scrapers, especially the availability of proxies and headless scraping. No installation of web drivers for Selenium, yay!

Development

ScrapingBee is based on REST API hence it can be consumed in any programming language. Since this post is related to Python so I’d be mainly focusing on requests library to use this tool. …


Image for post
Image for post

I got to know about Golang a year back and did write some toy programs in it while learning but then I gave up as I was not really enjoying despite liking Go language. It is very much like Python but with better performance because it’s compiled.

Recently I against wished to do something in Go. This time I did not want to go back to practice topic by topic. I rather thought to do some project and will learn whatever the stuff I need to get the thing done.

I have used Memcached in the past in PHP and really liked it so I thought to come up with a cache mechanism in Go. Upon learning I found out that Memcached has used the LRU cache technique. …


Image for post
Image for post

In this post, I am going to introduce another ETL tool for your Python applications, called Apache Beam.

What is Apache Beam?

According to Wikipedia:

Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing..

Unlike Airflow and Luigi, Apache Beam is not a server. It is rather a programming model that contains a set of APIs. Currently, they are available for Java, Python and Go programming languages. A typical Apache Beam based pipeline looks like below:

(Image Source: https://beam.apache.org/images/design-your-pipeline-linear.svg)

From the left, the data is being acquired( extract) from a database then it goes through the multiple steps of transformation and finally it is stored( load) into the database. …


Image for post
Image for post

Like many parts of the world, Pakistan has also suffered coronavirus pandemic. As of March 23, 800+ cases recorded out of the 6 recovered and 6 died. Many cities of the world have been locked down to avoid the spread of the COVID19 disease. Many companies around the world are now asking for work from home. People are being caged at home and they can’t roam around. It is not easy to spend time at home, especially when you are not used to working from home. Men usually are not used to staying home. …

About

Adnan Siddiqi

Pakistani | Husband | Father | Software Consultant | Developer | blogger. I occasionally try to make stuff with code. http://adnansiddiqi.me

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store