Scientific Databases for Relational Addicts
The Scientific Database for Relational Addicts
There is a surge of interest in scientific databases. The main theme of most of the recent publications is the special requirements of scientific databases that are not met by current mainstream databases. The argument in favour of building a special type of DBMS are typically:
1) Scientists have lots and lots of data (Peta bytes)
2) They want to use it!
3) They want to use ALL of it at the same time
4) They cannot store their data always in tables
5) They may need to know how their data is stored
6) They are interested in how their data came into being and how it evolves.
7) Scientists cooperate with each other!
Let us look at each of the arguments as all of them are very true and valid requirements. These are breathing in a new wave of research and development in the database field and promises exciting times ahead.
Scientists have lots and lots of data:
Yes! The scale of data we are talking about dwarfs even the largest transactional systems today. Why is this so? Current transactional systems, depend on human activity for record generation. Even a stock exchange which generates terabytes of data on a daily basis is guided by the volume of trades that are either fired manually or through automated trading systems that have a coarse granularity in terms of trades per hour. On the other hand, data generated by scientists is typically from equipment that generate data at much faster rate and in high volumes. Compare the data generated at stock exchange with that generated by an infrared telescope pointed permanently at the sky!
They want to use it!
The lifespan of relational transactional data is usually short though we store the data for a long time due to regulatory requirements. Take for example the data generated for a single purchase order made online. A record may be created for the order in an orders table with corresponding records in the order details, shipping, customer and inventory tables. These records may be queried or updated for the duration when the order is in pending status. Once the item is delivered, the order would be marked closed and is then relegated to a data warehouse for further analysis if the store maintains a data warehouse for analytics. It is very rare for these order records to be queried at a later point in time. The situation is different for scientist who is looking at a database of, say, weather pattern records. Almost all the data that has been captured in the past is relevant for the scientist and might, in fact, be used in validating a single model that the scientist is trying to prove. In this sense, they use the data like a traditional data warehouse, but without the aggregates and the cubes.
They want to use it ALL at the same time
Scientific invention or discovery is a process of experimentation followed by analysis of the results of the experiments. Scientists first capture data (this is their experiment) and then they might want to use all of the data captured in some analysis. Given the time taken for some of the experiments and the associated costs involved, it should be possible to analyze the data as it is being generated. If an anomaly is detected, they would like to change the parameters of their experiment and restart it before waiting for the experiment to complete. This is akin to the software engineering concept of detecting errors early in the development life cycle. In contrast, a banker would rarely look at all the transactions on a given day though such reports are generated and the numbers are tallied. Stream databases can provide a solution to the above problem but they would not be able to solve the rest of the requirements.
They cannot store their data in tables
The relational model was developed at a time when the target audience were business users. A tabular format is the ideal way to look at business data. The PC market really took off with the advent of the spreadsheet application. (Watch Triumph of the Nerds by Robert X. Cringley to know more about the fascinating story of the PC). One may argue that all experiments that we did at school and college involved data that was also tabulated in neat tables with the eventual discrepancy between expected and observed values! In a relational database ALL data can be stored in logical tables. A scientific database, on the other had, may need more than one logical structure to store data. To understand why scientists need structures other than tables, consider two scenarios:
Scenario 1: Spectral Data: The output from, say an NMR scanner(or any other detector for that matter) is best viewed as a graph or a plot. In spectroscopy this is called as a spectral graph. Though the graph is a plot of discreet data values which can be stored in tables , one rarely views the information as numbers. It is almost always seen as a graph. One of the founding principles of relational databases is that the data and presentation of data must be decoupled. The rationale for this is that the database management system must be left with the responsibility of storing the data and application programs could interpret the data in a manner that makes it meaningful to the user. What if I now want to store the application to display the spectral data as a graph along with the data so that the only way to view the data is as a graph? The relational model does not support this on principle. Object oriented databases would support the same but that is a different story that is outside the scope of this article.
Scenario 2: What we view as a table can also be viewed as a vector/matrix by a mathematician. In fact, the foundation for the relational model is based on tuple relational calculus that defines a table as a set of tuples where each tuple is a vector of related data items. Mathematicians are in connivance with scientists from all disciplines and they like, what they call, n-dimensional vectors. The coordinate systems we learn in school are 2 and 3 dimensional. Many naturally occurring phenomena can apparently be studied using n-dimensional algebra. A table is, strictly speaking, a set of vectors. The dimensions of the vector are defined by the columns of the table. It is difficult for computer scientists who have been working with the relational model for a long time to visualize a table that is multidimensional. The multidimensional data models and cubes used for OLAP are a close but incomplete approximation of what scientists want. It is different because in a scientific database they are not storing aggregate data in the different dimensions as in a OLAP cube. The above scenarios are not complete and there are many different ways in which scientists need to store data. The BLOB data type could be used for unstructured data but again, the BLOB data type has its own limitations including the fact that most database systems cannot index a BLOB based column. This presents a serious limitation with respect to performance specially since the amount of data tends to be quite large in scientific databases.
They may need to know how their data is stored!
Scientific data has a structure that is sometimes inherent to the way we look at data. A three dimensional mesh is one such example. Once the relational model captured the business data processing market, it also implicitly controlled and decided how the world stored its data. From the point of view of a database user, data is stored in tables. Internally, the storage takes many exotic forms. Some vendors store table data in disk blocks(Oracle) while there are also others that store each table in its own disk file. This is called Physical data independence in database parlance. Given the complex structure of scientific data, sometimes it becomes inevitable that the user of the database must be aware of the way the data is actually stored in the database. This requirement could break one of the fundamental advantages of the relational model and effort must be put in to try and prevent the same without sacrificing the data processing needs of the scientific community.
They would like to know how their data came into being and how it evolves
In relational databases, data quality is conveniently left to the user. The database management system can make no promises about the quality of data and this has lead to a lucrative market for data quality products. This is serious business in the scientific world. A classic case in point is the recent story about the UN report on climate change being based on the speculation of a scientist in the opinion column of a magazine. The report was based on the premise that global warming was high enough that the Himalayan Glaciers would all melt away by the year 2035. The UN report used this piece of information as fact while it was mere speculation of a scientist made during a telephonic interview! In the case of a typical relational database, we often use extra columns in tables to denote when the row was created, updated and the user that performed the said changes. In cases where more information is needed, we explicitly create audit tables to audit all changes to a particular table. The reason that this is not the default behaviour is because it imposes an overhead for normal transaction processing.
In contrast, in scientific databases, each piece of information must necessarily be annotated with various bits of information. There is also a type of database called ‘curated databases’ which are typically manually annotated by teams of scientists who have domain information about the data. This is a human intensive task and these databases are typically versioned at the database level. For scientific databases what we need is an automatic provenance management system. Provenance includes two aspects. Workflow Provenance describes how derived data has been generated from raw observational data. Dataflow provenance describes how data has moved through a network of databases. Relational pundits may ask why derived data is stored in a database in the first place. In scientific discovery, the inference from raw observational data, though derived, still needs to be stored! Another interesting aspect of scientific databases is that a piece of information used in a computation may itself be pulled from another database. This information needs to be recorded. For example, the table of isotopes lists the spectral peaks for all radioactive isotopes detected so far. This information is used in disciplines that involve spectral studies of radioactive elements. Provenance management is, perhaps, something that could be added to mainstream relational databases too as many other systems could benefit from it as well. One example is version control systems.
Scientists co-operate with each other
You do not expect bank A to share its customer data with bank B. We definitely hope not! The business software has evolved in silos even within the same organization. Technologies like web services, XML, SOAP etc are trying to bridge the gap between systems and make software systems talk to one another. However, it is not uncommon for bio science researcher in the United States to access data that is maintained by a chemistry department in Switzerland and a protein database maintained in India. Sharing of information is fundamental to scientific progress and scientists the world over know this and practice it. In contrast, a bank, pretty much generates all the data they need via their regular operations. Today, it requires considerable human effort to configure and maintain a truly global heterogeneous federated database where data flows seamlessly from one system to another. Federated databases provide fresh challenges in the areas of distributed transaction processing and metadata management.
Lessons from History
There are lessons that we can learn from the history of relational databases. Two important events have happened that are of interest in this context.
The first is the genesis of the relational model. This happened essentially as an outcome of path breaking work by a brilliant computer scientist, E.F, Codd. He essentially conceptualized the relational model and also suggested a set of 12 rules that can be used to determine if a database implementation can be considered relational. The relational model was a complete definition of a method to store and access data. It suited the needs of most organizations and has stood the test of time. The 12 rules of relational database, provided flexibility and also imposed constraints on what can be called a relational database. Considering the time at which these were proposed, they provided a blueprint for users to model their data requirements for the relational model that is valid even today.
The second important event is the development of Enterprise resource planning (ERP) systems. This did not have as much ‘computer science’ research going into it as research into the best practices of each industry and all common business functions of an enterprise such as finance, payroll, staffing etc.. This provided a packaged solution of best industry practices and many organizations were happy to even alter their existing workflows to adopt an ERP system. At the backend of an ERP system is a giant database that hold all the data that is captured in the regular functioning of an organization. On top of this data, there are workflows that model the actual physical processes of the organization. These systems, though packaged, provide a certain degree of customization. For example, every automobile manufacturer would have some processes that are similar while maintaining some unique processes (often patented!) of their own. ERP systems provided organizations worldwide with a business reason to adopt a information system and in effect a database system to store their organizational data. This lead to considerable investment in the database field that fueled innovation in product development and research.
What can be done
What is the relevance of these events to scientific databases? There is a growing number of articles and discussions about the need to evolve the relational model to support these needs. The opinion on whether the relational model should support these needs is divided. What, we however need are
- A data model or set of models for scientific data
- A common universally accepted interface for scientific equipment
- Definition of the notion of a transaction
- A workflow definition for scientific experiments
- A way to talk to the database without an interpreter! (the database language)
Scientists have many things they want to store in the database. These include raw observational data, graphs, three dimensional meshes, n-dimensional data, formulae etc.. On top of this data they need data provenance. Can we come up with a unifying all encompassing data model that would satisfy the above requirements? An important aspect here is a new data model might need a minor change in the way scientists look and use data.
The second challenge is in making all scientific equipment talk the same language when it comes to exchanging data. The adoption of standards such as HL7 and DICOM has made allowed the possibility of interoperability of medical imaging equipment from different vendors. It is time for all kinds of scientific equipment to adopt a similar approach. Imagine a function generator that is connected to an oscilloscope that is in turn connected to a particle detector where the final observations are directly sent to a network port to be received by a software service that adds the relevant provenance data and pushes the data to a common database!
Traditionally workflows were not part of the database and with good reason. The role of the database is to store and retrieve data. Workflows are built on top the database layer. What we need however, is an ERP like system for scientific databases. The processes will not be common across the field just as the processes in finance are different from payroll. We can, however, attempt modeling exercise for a particular field. I would for example, like to see if we can define experiments in fiber optics as workflows that can be guided by a software system. What we need is for the database system to support the features and functionality of Matlab’s Simulink.
A transaction in a relational database is well defined based on the ACID properties(Atomicity, Consistency, Isolation and Durability).. Is this definition valid for scientific databases? The workloads on scientific databases are different and current literature seems to suggest that most scientific databases tend to have large number of inserts while being low on update and delete operations. If we bring provenance into the picture an update would mean the creation of a new version of the data item in many cases. We might need to redefine the notion of consistency due to the addition of provenance information inside the database.
Finally, would SQL suffice as the default interface to the database. SQL is declarative and it may not be suitable to the more exotic formats needed to scientific databases. The fundamental challenge here is that scientists are sophisticated users who may not be willing to spend considerable effort in learning a new programming language. What we need is a language that is intuitive to the way a scientist looks and ‘feels’ data!
These are exciting times! There is a critical mass of users who have a need that needs to be addressed by the database community! In a rush to solve this new need let us not alter something that has stood the test of time. Let us use the lessons we have learned well and maybe its not a bad idea to reinvent the wheel in some cases! Scientific databases pose novel challenges such as provenance, complex data types and distributed processing. The database community has focused on all these issues in the past. The work on these topics were carried out independently and the need for these features was never as urgent as it is now. Perhaps, it is time to bring together research from various branches of the database community and address the needs of the scientific community as a whole.
- Anastasia Ailamaki, Verena Kantere, Debabrata Dash, “Managing Scientific Data”, Communications of the ACM, Vol. 53 No. 6, Pages 68-78, http://doi.acm.org/10.1145/1743546.1743568
- Buneman, P., Chapman, A., and Cheney, J. 2006. Provenance management in curated databases. In Proceedings of the 2006 ACM SIGMOD international Conference on Management of Data (Chicago, IL, USA, June 27 – 29, 2006). SIGMOD ’06. ACM, New York, NY, 539-550. http://doi.acm.org/10.1145/1142473.1142534
- Codd, E.F. (1970). “A Relational Model of Data for Large Shared Data Banks”. Communications of the ACM 13 (6): 377–387. http://doi.acm.org/10.1145/362384.362685
- Ordonez, C., Chen, Z., and García-García, J. 2007. Metadata management for federated databases. In Proceedings of the ACM First Workshop on Cyberinfrastructure: information Management in Escience (Lisbon, Portugal, November 09 – 09, 2007). CIMS ’07. ACM, New York, NY, 31-38. http://doi.acm.org/10.1145/1317353.1317361
- Ioannidis, Y. E., Livny, M., Ailamaki, A., Narayanan, A., and Therber, A. 1997. Zoo: a desktop experiment management environment. In Proceedings of the 1997 ACM SIGMOD international Conference on Management of Data (Tucson, Arizona, United States, May 11 – 15, 1997). J. M. Peckman, S. Ram, and M. Franklin, Eds. SIGMOD ’97. ACM, New York, NY, 580-583. http://doi.acm.org/10.1145/253260.253415
- ‘Codd’s 12 rules of relational databases’, http://www.cse.ohio-state.edu/~sgomori/570/coddsrules.html
- ‘World mislead over Himalayan Glacier Meltdown’, http://www.timesonline.co.uk/tol/news/environment/article6991177.ece
- ‘Enterprise Resource Planning Systems’, http://en.wikipedia.org/wiki/Enterprise_resource_planning
- ‘Interoperability of Health Information Technology’, http://www.hl7.org/