Hands-On with PostgreSQL Internals
As part of the university outreach efforts at Hewlett Packard Enterprise, a team of engineers from the SQL/MX database group within the HPE NonStop Servers division conducted a one credit course on database internals at PSG college, Coimbatore in August 2016. The course was attended by about 45 students from the computer science department.
HPE NonStop servers ship with a proprietary operating system and associated software development stacks. These servers (formerly Tandem computers) were built for transaction processing and hence they have always shipped with a database management system. The current avatar of the RDBMS that ships with NonStop servers is called NonStop SQL/MX. Being members of a group that owns a database management system, our natural choice for a one credit course was on introducing the students to the idea that work on relational database engines is seeing a revival thanks to the possibilities exposed by the new problems that are yet to be solved for the era of big data.
Most undergraduate programs in the computer science stream offer only an introductory course on relational databases that cover data modeling, query languages and concepts of transaction processing. The challenges of building a robust query compiler, the algorithms used for relational algebra operator evaluation and the strategies for query execution are generally part of a master’s level course on database internals. There are very few groups in the academia in India that focus on database internals. The best and most recognized groups in India are the database systems groups at IISc, Bangalore and IIT, Bombay.
Given that most of the course attendees would not have the necessary background on database internals, our focus was on the following:
- Introduce one aspect of database internals that can be grasped in a few short hours. We selected query cost estimation using histograms.
- Get the participants familiar with the source code of one open source database management system. The idea was that once they are comfortable working with the source code, they would have the necessary skills to get into the details of the database system implementation if and when they wanted to extend the database capabilities for a future project. In doing this, we had the following sub goals:
- Walk the participants through the steps of compiling and installing open source software.
- Introduce the basic steps in understanding execution flow via the debugger.
- Use built in regression tests that come packaged with open source software.
- Extending the database engine’s capabilities.
A colleague of mine handled query cost estimation and I volunteered for the second part. I started with little knowledge of PostgreSQL internals and was delightfully surprised with the instructive content available online for anyone wishing to understand PostgreSQL internals. I also had a serendipitous meeting with a friend of mine who used to work at Microsoft (https://www.linkedin.com/in/cvishnuprasad/). He convinced me to use a new tool from Microsoft called Microsoft Code. Code is a cross platform IDE that works beautifully on Linux. I found it much better than Eclipse for the task I was working on. I prepared a virtual machine with everything necessary for the participants to get started with a collection of PostgreSQL exercises I had prepared. At the end of the six exercises, the participants were expected to be comfortable with the following:
- Building, installing and starting PostgreSQL database server on a Linux system.
- Running built-in tests that validate fundamental features of PostgreSQL.
- Walking though the execution of simple SQL statements via the debugger.
- Inspecting the output of the stages of the SQL compilation via the debugger.
- Modifying an existing built in function of PostgreSQL and rebuilding PostgreSQL after the change.
- Adding your own custom function as a built-in function of PostgreSQL
One the virtual machine was made ready, the good folks at the CSE department at PSG college downloaded the virtual machine image from my cloud storage and had it deployed on all the computers to be used for the hands-on session. A previous visit from my team mate helped iron out the system requirements for the virtual machine.
The hands-on session with PostgreSQL was well received and subsequently, the department requested for project ideas on the same topic that the participants could work on as part of the same one credit course. I assembled a set of projects that were doable with the knowledge gained in the hands-on session. The projects were on integrating ideas from other domains like encryption, text analytics etc into the database engine rather than in extending the query compiler or other core database engine areas. The idea was to give the participants the confidence that they can implement an idea/algorithm and get it integrated with a popular database engine if they needed it. Extending the core of the database engine at the student level is more suited to a master’s level course on database internals.
If you would like to try the exercises and the projects yourself, you can download the Ubuntu Linux based virtual machine (VMWare based) and the course material (two pdf documents – one with the exercises and the other with the projects) from the following Microsoft OneDrive location. The folder also contains some relevant papers that we had distributed to the course participants. The compressed virtual machine image available for download is 4GB in size. The virtual machine is ready to use for the exercises and the projects. The instructions to login and get started are in the document containing the hands-on labs.
Please drop in your suggestions as comments to this post.