Getting started with Rocksdb and Python
In this post, I am going to discuss RocksDB.
RocksDB is an embeddable persistent key-value store system developed by Facebook. It was originally forked from LevelDB, which was created by Google.
According to Wikipedia:
RocksDB is a high performance embedded database for key-value data. It is a fork of Google’s LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives (SSD), for input/output (I/O) bound workloads. It is based on a log-structured merge-tree (LSM tree) data structure. It is written in C++ and provides official language bindings for C++, C, and Java; alongside many third-party language bindings.
RocksDB has particularly been optimized for flash drives and fast storage for low latency data access. Like Redis, RocksDB also stores in-memory data but unlike Redis, it is not a server, it is an embeddable library similar to SQLite. RocksDB is used extensively for storing persistent data on SSD at Facebook and by various services that serve online queries on hard drives.
The possibilities of RocksDB usage are endless, you may use it as a storage engine that stores data and generates a personalized home page for each user. So instead of making multiple SELECTs queries based on the user that burdens the DB, you may store that data in Key/Value format where the UserID could serve as a Key that holds all the data in formats like JSON.
In this post, I am discussing RocksDB basic usage irrespective of a certain use case and how you can use it in your Python applications.
Installation and Setup
In order to use RocksDB in Python, you must have RocksDB installed on your system, and then with the help of RocksDB’s Python binding, you may access RocksDB in your programs. Since I did not want to mess with my Mac environment, I downloaded a Debian-based Python image and installed RocksDB in it. Below are the steps. First, install required dependencies and RocksDB itself:
apt install rocksdb-tools librocksdb5.17 librocksdb-dev libsnappy-dev liblz4-dev
As of now, librocksdb5.17
was the latest version available for me in Docker.
and then
pip install python-rocksdb
I have connected my VS Code with the remote container so that I can directly code within the container. Use this VSCode extension for this purpose and attach your container. This is how my VSCode looks after attaching it with a remote container and selecting a remote docker-based Python interpreter.
Sweet! isn’t it?
Development
Let’s import the library and see whether it really works or not
import rocksdb
if __name__ == "__main__":
print(rocksdb)
If things are really installed, it will output like the following:
root@9f7d3fc73b74:/code# /usr/local/bin/python /code/main.py <module 'rocksdb' from '/usr/local/lib/python3.9/site-packages/rocksdb/__init__.py'>
Let’s move forward
if __name__ == "__main__":
db = rocksdb.DB("test.db", rocksdb.Options(create_if_missing=True))
db.put(b"a", b"ROFL")
print(db.get(b"a").decode("utf-8"))
The first line opens the DB file with certain options. Here, I have set create_if_missing
to True
to avoid file not found errors. Then, I set the a
key with the text ROFL
. If you notice I am using byte type b
here for both keys and values. RocksDB supports byte stream for keys instead of string or other data type. I later converted it into a str
by calling decode('utf-8)
Let’s see what happens in the folder where the DB was created. The first thing which I noticed that was shocking for me that the test.db
was not actually a file but a folder.
root@9f7d3fc73b74:/code# ls -la total 16 drwxr-xr-x 3 root root 4096 Oct 3 15:24 . drwxr-xr-x 1 root root 4096 Oct 3 13:37 .. -rw-r--r-- 1 root root 180 Oct 3 15:20 main.py drwxr-xr-x 2 root root 4096 Oct 3 15:24 test.db root@9f7d3fc73b74:/code#
When you execute cd test.db
and list files it shows the following:
root@9f7d3fc73b74:/code/test.db# ls -l total 152 -rw-r--r-- 1 root root 27 Oct 3 15:24 000003.log -rw-r--r-- 1 root root 16 Oct 3 15:24 CURRENT -rw-r--r-- 1 root root 37 Oct 3 15:24 IDENTITY -rw-r--r-- 1 root root 0 Oct 3 15:24 LOCK -rw-r--r-- 1 root root 15695 Oct 3 15:24 LOG -rw-r--r-- 1 root root 13 Oct 3 15:24 MANIFEST-000001 -rw-r--r-- 1 root root 4721 Oct 3 15:24 OPTIONS-000005
It contains a log file, an option file, and a few more. let’s view the content of 000003.log file.
root@9f7d3fc73b74:/code/test.db# cat 000003.log ���aROFLroot@9f7d3fc73b74:/code/test.db#
As you figured, it is not storing data in plain-text format. You can clearly see an a(Key) and ROFL(value) stored in compressed binary format. CURRENT tells about the latest manifest log.
root@9f7d3fc73b74:/code/test.db# cat CURRENT MANIFEST-000001
IDENTITY keeps track of edits. In my case it shows:
root@9f7d3fc73b74:/code/test.db# cat IDENTITY c55f9d31-f622-4335-8cbf-f3ca9ce324ef
Then a 0-byte LOCK file. In RocksDB only a single process can open the file hence a single process can write data. LOG file as the name suggests logs everything. The MANIFEST-000001 did not have anything readable in it. The next file is OPTIONS-000005 which has all the available options available with their current values. Upon running the grep
commands it shows the following:
root@9f7d3fc73b74:/code/test.db# cat OPTIONS-000005 | grep missing create_missing_column_families=false create_if_missing=true
As you can see, create_if_missing
is set to true
which is quite obvious.
If you are further interested in the internals you may visit this Wiki page.
Similarly, you may delete a key.
db.delete(b"a")
print(db.get(b"a").decode("utf-8"))
Upon running it outputs the following error:
root@9f7d3fc73b74:/code/test.db# /usr/local/bin/python /code/main.py
ROFL
Traceback (most recent call last):
File "/code/main.py", line 8, in <module>
print(db.get(b"a").decode("utf-8"))
AttributeError: 'NoneType' object has no attribute 'decode'
root@9f7d3fc73b74:/code/test.db#
Since the a key was already removed, upon accessing it threw an exception. The C++ and Java port also provides the option of TtlDB which let you set an expiry for the keys, a feature that you can use for using RocksDB as a web cache. Unfortunately, it is still not available in Python bindings.
Conclusion
In this post, I introduced RocksDB as a key/value store. RocksDB is also used as a storage engine and is used by the DB systems like ArangoDB, MyRocks(MySQL Storage Engine based on RocksDB), CockroachDB, and others. You may use it as a replacement for Memcache if you are using flash storage devices. Here is a comprehensive list of RocksDB usage.
Originally published at http://blog.adnansiddiqi.me on October 3, 2022.