I found this cable in this state in a meeting room today…
Does this qualify as deliberate art or a passionate expression of incredible boredom during a previous meeting???
I found this cable in this state in a meeting room today…
Does this qualify as deliberate art or a passionate expression of incredible boredom during a previous meeting???
Posted in Uncategorized | 2 Comments »
My father knew the difference between the knowing the name of something and knowing something
-Richard Feynman in “The Pleasure of finding things out”
Posted in Uncategorized | Tagged Knowledge, Richard Feynman | Leave a Comment »
| Amateurs built the Ark… Professionals built the Titanic…
~author unknown
|
|
| |
|
Image sources:
http://www.garvandwane.com/images/noahs_ark.jpg
Posted in Uncategorized | Tagged quote | Leave a Comment »
Murphy’s law of Tennis –
This is the only explanation for a ball that meets the net that would otherwise sail over it. All the laws of motion indicate that the ball should reach the other court but the law of attraction between the net and the ball has the final word!
Posted in Uncategorized | Tagged Murphy's law, Tennis | Leave a Comment »
We have been living in a world where we believe that “Information is Power”.
Kofi Annan, while addressing the Global Knowledge Conference at Toronto in 1997 said, “Knowledge is power. Information is liberating. Education is the premise of progress, in every society, in every family.”
I would like to live in world where information is NOT power. Information and knowledge should be shared and real power should lie in the use and interpretation of information.
Image source: http://www.misconceptionjunction.com/wp-content/uploads/2010/10/brain-763982-1.jpg
Posted in Uncategorized | 1 Comment »
Sankey tank is a beautiful artificial lake in Bangalore, India. This is a HDR processed image taken using a Canon Powershot A630 and CHDK.
Posted in Uncategorized | 1 Comment »
A traditional database stores information about objects of interest in the application world. A data model provides a framework to describe entities of interest. Each entity would have attributes that uniquely define an instance of the entity. The role of a data model is to provide storage and access mechanisms for efficient storage and retrieval mechanisms for the end user. The relational model proposed by E.F. Codd [2] has gained widespread acceptance. The initial adoption of this model was to represent static real world objects and processes. Examples of these are Enterprise Resource Planning (ERP) systems and Customer Relationship Management (CRM) Systems. The adoption of the relational model was accentuated by the SQL standard and its ease of use.
A standard relational database has four broad functional components. They are the Storage and Access manager, the query processing engine, the transaction manager and the user access manager. One of the important roles of the query processing engine is to find an evaluation plan for a given query that would incur the minimum cost. The dominating factor affecting query cost is the number of disk reads that would be needed to fetch all the records satisfying a given query. The other factors include the size of the database buffer in memory, the load (number of concurrent queries) on the database at the given point in time, the distribution of data on disk etc. The traditional approach to improve query performance is to build database indexes that provide predictable access times in the event that an index can be used to execute a query. The B+ tree which is a variant of the B tree proposed by Rudolph Bayer [1] has been the index structure of choice in relational databases. The other index structure that is used is the hash index that uses a hash table to index and retrieve records. The B+ tree is a height balanced tree which is designed in a manner such that each node is at least half full and a node size corresponds to the size of data returned by one disk read (database blocks). The B+ tree is suitable for answering range as well as point queries on one dimensional data.
Many traditional applications of database systems need to model only one dimensional data. There are, however, an increasing number of applications that need to model and query multi dimensional data. Consider the following three successively interesting queries: (assuming we have information about the names and address of all book stores in Bangalore)
The first query can be answered by a traditional relational database. The second query, however, requires information about the user’s current location. Given the user’s current location, the query translates into a search for points in a circular geographical area marked by the 2 km radius. Databases that support such queries are called spatial databases. These databases support operations on geometrical shapes. The geometrical shapes are n-dimensional hyper rectangles. The third query not only requires information about the user’s current location but also a prediction of the user’s location at a future point in time (10 minutes from the current time). To answer the third query, we not only need spatial information about the user and the book stores, but also information that depends on time. In this case, the location of the user changes with time and is hence a function of time. Databases that support queries of the third kind are called spatiotemporal databases.
The next post will explore querying on Spatial and Spatiotemporal databases.
Posted in Uncategorized | Tagged Relational Database, Spatial Database, Spatiotemporal Database | Leave a Comment »
HDR stands for High Dynamic Range photography. The amazing thing for me is the fact that our human eye sees in HDR! No camera today can take a single picture that can capture the brightest and darkest areas of a scene with equal lighting. HDR is a classic case where the sum is much much better than the individual parts. HDR is a technique where we take multiple exposures(read photos) of the same scene with different exposures. Each shot is taken at a different exposure. In this manner each photograph would capture different parts of a scene with the correct amount of lighting. We then merge all the photographs using a HDR merging tool. There is a long way to go before I fully understand this art. I am halfway through the tutorial at http://www.stuckincustoms.com/hdr-tutorial/
The following photographs were taken on a weekend drive to Puttaparthi. This small lake is about 60kms from Bangalore. These are taken with my Canon Powershot A630. I use CHDK (http://chdk.wikia.com/wiki/CHDK) for exposure bracketing as the camera does not have built in support for the same. The following are the settings on the CHDK menu and on the camera for these shots:
CHDK:
CHDK Menu->Extra Photo Operations->Bracketing in Continuous Mode
Tv Bracketing Value = 1 Ev, Bracketing Type = +/-
Camera:
Shot Mode – Program (P )
Func/Set->Continuous Shooting Mode->Custom->MENU->Delay=1 Sec,Shots=5
Posted in Uncategorized | Tagged HDR, High Dynamic Range, photography | 3 Comments »
Database normalization is a topic that is discussed much more than it is really understood or practiced. Database normalization is the responsibility of a data modeler. This role is often loosely defined in software projects and more often than not the engineer assigned this task rarely finds the time or the reason to justify the effort on database normalization.
Being interested in database design, I have tried to read and understand every book, paper and article that I could get hold of. Reading multiple sources has actually left me confused! The reason for this is that there are many schools of thought with respect to database normalization. Some use the notion of ‘Functional Dependencies’ while others do not. Many discussions use the notion of an attribute and try to define all normal forms in one line – ‘An attribute depends on the key, nothing but the key and the whole key’. The faction that uses functional dependencies is not consistent in itself. Some say what functional dependencies are allowed while others say what are not! Some use the notion of closure of a set of functional dependencies while others don’t. I have difficulty in visualizing a complex web of interconnections which is what happens if you try to understand everyone’s definition of normal forms. If you are a reader and have a good understanding of normal forms, maybe you should not read mine too (if you are like me). With time, I have come to understand that one of the ways to understand a topic is to try and get into the mind of the engineer/scientist who proposed it originally. I read E.F.Codd and also read and re read C.J. Date. I now think I have understood what they meant and I hope my understanding is correct.
Notional Conventions
In the following passages I use the term table and relation interchangeably. Similarly I use the terms column and attribute interchangeably. They mean the same thing in the context of this article. In the same manner a tuple and a row are identical.
Why normal forms?
Normal forms are a way to develop relational models that avoid unnecessary redundancy of data. The problem with redundant data is NOT the extra storage required. The problem actually is:
I find the second reason rather fascinating! One may argue that the first reason can be offset by extra code but how do we mitigate the second risk?
Consider a relation used to store products and supplier information. The information regarding products and suppliers is stored in a single table as shown below. The table shows the purchase of products from a supplier on different dates.
Products Table
| prod_id | Prod_name | supplier_id | supplier_phone | purchase_date |
| p001 | monitor | samsung01 | 9188888 | Jan – 1 – 2009 |
| p001 | monitor | samsung01 | 9188888 | Feb – 1 – 2009 |
| p002 | printer | samsung01 | 9177777 | Feb – 15 – 2009 |
| p003 | keyboard | logitech01 | 9166666 | Nov – 23 – 2009 |
| p003 | keyboard | microsoft01 | 9155555 | Mar – 15 – 2010 |
Assume the following are true about the real world.
For the above table, the primary key is (prod_id,supplier_id,purchase_date). This means that the combination of the three columns should be unique for every row in the table. Formal definitions of keys follow later in the article.
Given that this is the complete schema, all of the following scenarios would be a ‘problem’
| prod_id | Prod_name | supplier_id | supplier_name | supplier_phone | purchase_date |
| p001 | monitor | samsung01 | ABC trading | 9188888 | Jan – 1 – 2009 |
| p001 | monitor | samsung01 | ABC trading | 9188888 | Feb – 1 – 2009 |
| p002 | printer | samsung01 | ABC trading | 9177777 | Feb – 15 – 2009 |
| p003 | keyboard | logitech01 | First Comp | 9166666 | Nov – 23 – 2009 |
| p003 | keyboard | microsoft01 | Digital Life | 9155555 | Mar – 15 – 2010 |
| p004 | LED Monitor | acer01 | Acer Corp | 9144444 | null |
| p005 | netbook | acer01 | Acer Corp | 9144444 | null |
In trying to add a new supplier, we have added the supplier id and phone number twice – once for each product from the supplier. We have also made the purchase date null. This would not even be allowed in a real scenario as we would want the purchase date to be a non null column. This ‘inability’ to represent the information that acer01 is a supplier of netbooks and LED monitors is the second reason why we need database normal forms alluded to above.
so, redundant data is bad! very bad!
It is sometimes easy to spot data redundancies in design when we are looking at live data as in the example above. What we, however need is a generic method to deduce data redundancies without even looking at sample data. Database normalization is one way to identify potential redundancies in design. Codd originally proposed three normal forms. They are called the First, Second and Third Normal forms. Due to a case not anticipated, the third normal form was revised and proposed as an entirely new normal form called Boyce Codd normal form(BCNF). There are many more normal forms but they are rarely used in practice. Here are the definitions of the normal forms as I would like to remember and use them! However, a few definitions are in order first!
Functional Dependency:
It is easy to understand normal forms using the notion of a functional dependency(FD). A functional dependency is a relationship between two or more columns of a table. Given a relation R and a set of attributes alpha and beta, a functional dependency between alpha and beta is represented as
and is defined as:
Given any two tuples(rows) t1 and t2 of R:
In the products table above, the functional dependencies that need to be valid are:
prod_id –> prod_name
supplier_id –> supplier_name
supplier_id –> supplier_phone
It is obvious to see that functional dependencies model constraints that we would like our data to satisfy. We would not like two different products with the same product id, would we?
Derived Functional Dependencies
Given a set of FDs it is possible to derive more FDs that are logically implied by the original set of FDs. The complete set of FDs is called the closure of a set of FDs. The closure can be a pretty large set and we have a smaller set that is equivalent to the closure called the canonical cover. For purposes of simplicity we will not get into the details of this in this article.
Keys (candidate and primary)
A table should have one or more set of columns that uniquely identify every row in the table. The column or columns that contain unique(not null and different) values for every row of the table is called a key. A key can be a simple key or a compound key. A simple key contains a single attribute while a compound key is an unordered collection of attributes. A table can have one or more such set of attributes that form a key. Each such key is called a candidate key. One of these candidate keys is designated as a primary key. One of the reasons for designating a primary key is to use it in foreign key relationships. I am using the term key to refer to primary and candidate keys.
A key can also be defined as a functional dependency.
Given a key ‘K’ for a relation R, the following functional dependency is true:
K –> R
where R stands for all the attributes of the relation.
In the products and suppliers example above, we have a compound key and that is the only key on the table. In terms of a functional dependency (FD), we can say that the following FD holds
(product_id,supplier_id,purchase_date) –> (supplier_name, prod_name)
Let us add a new column called row_id that would hold a unique value for each row. The table now looks like this:
| row_id | prod_id | Prod_name | supplier_id | supplier_name | supplier_phone | purchase_date |
| 001 | p001 | monitor | samsung01 | ABC trading | 9188888 | Jan – 1 – 2009 |
| 002 | p001 | monitor | samsung01 | ABC trading | 9188888 | Feb – 1 – 2009 |
| 003 | p002 | printer | samsung01 | ABC trading | 9177777 | Feb – 15 – 2009 |
| 004 | p003 | keyboard | logitech01 | First Comp | 9166666 | Nov – 23 – 2009 |
| 005 | p003 | keyboard | microsoft01 | Digital Life | 9155555 | Mar – 15 – 2010 |
| 006 | p004 | LED Monitor | acer01 | Acer Corp | 9144444 | null |
| 007 | p005 | netbook | acer01 | Acer Corp | 9144444 | null |
We now have two keys for the relation.
Key 1: (row_id)
Key 2: (prod_id, supplier_id, purchase_date)
Key 1 is a simple key while key 2 is a compound key. Key 1 and key 2 are called candidate keys. Let us designate Key 1 as the primary key. (It is a good practice to designate a simple key as a primary key).
Key Column and Partial Key Column:
If a candidate key is a simple key, the column is also called a key column. row_id in the example above is a key column. If the candidate key is a compound key, the columns that constitute the candidate key are all called partial key columns. Strictly, any subset of a key is called a partial key. For key 2 in the example above all of the following are partial keys
prod_id
supplier_id
purchase_date
(supplier_id, purchase_date)
(prod_id,purchase_date)
(prod_id,supplier_id)
A column that is not a key column or a partial key column is called a non-key column. In effect all columns can be classified as key, partial key or non-key columns.
The definition of Normal Forms
It is easy to deduce the normal forms based on the definition of a functional dependency(FD). Since the First Normal Form (1NF) does not depend on FDs, here is the definition of 1NF.
First Normal Form:
The domain of every column should be atomic.
This actually excludes user defined data types. It also excludes clubbing the values of one or more columns as a single column. In a strict sense even storing the full name of a person as (firstname,lastname) in a single column is a violation of the first normal form!
The following table violates first normal form because of columns Name and Dependents. The Name column actually stores three things: the Title, First Name and Last Name. The Dependents column is storing the list of dependents and their ages in a single list! The fact that a cat is not a dependent cannot be handled by the relational model!
| SSN | Name | Dependents |
| 778324673 | Mr. Bill,Bryson | (Sara, 32, James,12) |
| 778324456 | Dr. John, Smith | (Molly, 25, Peter,8, Cat, 3) |
Higher Normal Forms
Second Normal From (2NF), Third Normal Form (3NF) and Boyce Codd Normal Form (BCNF) are defined in terms of the functional dependencies they allow on the relation. It was a brilliant idea of Codd to look at functional dependencies. Why? Look at a functional dependency of the form
alpha –> beta
What this says is that whenever alpha is the same(repeats) for two rows, beta would also be the same(repeat). Redundancy creeps in when we have a table that stores alpha, beta and other columns that do not repeat when alpha repeats. We repeat the value of beta every time alpha repeats. This repetition of beta is data redundancy. To avoid this redundancy, what if I can make sure that alpha never repeats in a given table? I can do that simply by making sure that alpha is a key of the table! This can be done if I create a table that contains only the columns of alpha and beta. This is the main aim of database normalization!
Functional Dependency Templates
Given the FD alpha->beta, what are the possible types that alpha and beta can have?
alpha could be simple or composite
beta could be simple or composite
We can ignore the case where beta is composite because it can be rewritten as multiple FDs where alpha is the same and each element of beta forms the dependent attribute of the FD.
For ex, AB->CD is equivalent to AB->C and AB->D.
Since a column can only be a key, a partial-key or a non-key, the only FDs possible are of the form
Based on the definitions of a FD and a key, 8 and 9 are not possible! Similarly 6 is not possible because, in that case the partial-key would also become a key. From the point of view of avoiding data redundancies, FDs 1,2 and 3 are ideal since we are guaranteed that a key would not repeat within a table. Removing the FDs that are not possible and the FDs 1,2 and 3 since they do not contribute to data redundancies, we now have only the following ‘undesirable’ forms
A table should not have any FDs of the above forms . However, we cannot wish them away! A functional dependency models a real world constraint and our database design must be able to enforce that. 2NF, 3NF and BCNF each preclude one of the above types of FDs existing on a relation. It is also important to note that the normal forms are defined in a an ordered fashion. For a table to satisfy the conditions of 3NF it must be in 1NF and 2NF. Similarly for a table to satisfy BCNF it must be in 1NF, 2NF and 3NF.
What if a table satisfies the conditions for 2NF and violates the condition for 3NF? The table is said to be in 2NF and a decomposition method is defined to convert the table to a set of tables each of which would satisfy the conditions of 3NF. Decomposition algorithms are specified for all three normal forms – 2NF, 3NF and BCNF. It i snow easy to define the three remaining normal forms
Second Normal Form (2NF)
A table is in 2NF if it is in 1NF and it has no FD of the form
partial-key –> non-key
Third Normal Form (3NF)
A table is in 3NF if it is in 2NF and it has no FD of the form
non-key –> non-key
Boyce Codd Normal Form (BCNF)
A table is in BCNF if it is in 3NF and it has no FD of the form
partial-key –> partial-key
As mentioned earlier, if a table violates a normal form, we use a decomposition algorithm to decompose the table into one or more tables. This decomposition algorithm is different for each normal form. I will discuss decomposition and related issues in a separate article.
A parting note though.. On what set of FDs should we check the above conditions? Ideally it should be on the closure of the original set of FDs. The closure can be computed from the original set by repeated application of the Armstrong’s axioms for functional dependencies. Again, I am not discussing Armstrong’s axioms here. The closure could be a large set and there are ways to handle this too. I Will discuss them too in a separate article. The purpose of this article is to help quickly identify if a given FD violates a specified normal form. If that purpose is satisfied, the purpose of writing this is satisfied.
Conclusion
Database normal forms help us in removing data duplication in relational databases. Data duplication is bad because it makes insert, update and delete operations complicated if we want to maintain data consistency. Database normal forms help us identify potential data duplication and help us remove them. Normal forms use the notion of Functional Dependencies which are a kind of dependency between the columns of a table. Each normal form precludes certain FDs on a table. If an FD violates a normal form, we do not throw it away. We decompose the table into more table so that each of the decomposed tables satisfy our desired normal forms. This decomposition has some side effects that are not discussed here.
Posted in Uncategorized | 1 Comment »
The Scientific Database for Relational Addicts
There is a surge of interest in scientific databases. The main theme of most of the recent publications is the special requirements of scientific databases that are not met by current mainstream databases. The argument in favour of building a special type of DBMS are typically:
1) Scientists have lots and lots of data (Peta bytes)
2) They want to use it!
3) They want to use ALL of it at the same time
4) They cannot store their data always in tables
5) They may need to know how their data is stored
6) They are interested in how their data came into being and how it evolves.
7) Scientists cooperate with each other!
Let us look at each of the arguments as all of them are very true and valid requirements. These are breathing in a new wave of research and development in the database field and promises exciting times ahead.
Scientists have lots and lots of data:
Yes! The scale of data we are talking about dwarfs even the largest transactional systems today. Why is this so? Current transactional systems, depend on human activity for record generation. Even a stock exchange which generates terabytes of data on a daily basis is guided by the volume of trades that are either fired manually or through automated trading systems that have a coarse granularity in terms of trades per hour. On the other hand, data generated by scientists is typically from equipment that generate data at much faster rate and in high volumes. Compare the data generated at stock exchange with that generated by an infrared telescope pointed permanently at the sky!
They want to use it!
The lifespan of relational transactional data is usually short though we store the data for a long time due to regulatory requirements. Take for example the data generated for a single purchase order made online. A record may be created for the order in an orders table with corresponding records in the order details, shipping, customer and inventory tables. These records may be queried or updated for the duration when the order is in pending status. Once the item is delivered, the order would be marked closed and is then relegated to a data warehouse for further analysis if the store maintains a data warehouse for analytics. It is very rare for these order records to be queried at a later point in time. The situation is different for scientist who is looking at a database of, say, weather pattern records. Almost all the data that has been captured in the past is relevant for the scientist and might, in fact, be used in validating a single model that the scientist is trying to prove. In this sense, they use the data like a traditional data warehouse, but without the aggregates and the cubes.
They want to use it ALL at the same time
Scientific invention or discovery is a process of experimentation followed by analysis of the results of the experiments. Scientists first capture data (this is their experiment) and then they might want to use all of the data captured in some analysis. Given the time taken for some of the experiments and the associated costs involved, it should be possible to analyze the data as it is being generated. If an anomaly is detected, they would like to change the parameters of their experiment and restart it before waiting for the experiment to complete. This is akin to the software engineering concept of detecting errors early in the development life cycle. In contrast, a banker would rarely look at all the transactions on a given day though such reports are generated and the numbers are tallied. Stream databases can provide a solution to the above problem but they would not be able to solve the rest of the requirements.
They cannot store their data in tables
The relational model was developed at a time when the target audience were business users. A tabular format is the ideal way to look at business data. The PC market really took off with the advent of the spreadsheet application. (Watch Triumph of the Nerds by Robert X. Cringley to know more about the fascinating story of the PC). One may argue that all experiments that we did at school and college involved data that was also tabulated in neat tables with the eventual discrepancy between expected and observed values! In a relational database ALL data can be stored in logical tables. A scientific database, on the other had, may need more than one logical structure to store data. To understand why scientists need structures other than tables, consider two scenarios:
Scenario 1: Spectral Data: The output from, say an NMR scanner(or any other detector for that matter) is best viewed as a graph or a plot. In spectroscopy this is called as a spectral graph. Though the graph is a plot of discreet data values which can be stored in tables , one rarely views the information as numbers. It is almost always seen as a graph. One of the founding principles of relational databases is that the data and presentation of data must be decoupled. The rationale for this is that the database management system must be left with the responsibility of storing the data and application programs could interpret the data in a manner that makes it meaningful to the user. What if I now want to store the application to display the spectral data as a graph along with the data so that the only way to view the data is as a graph? The relational model does not support this on principle. Object oriented databases would support the same but that is a different story that is outside the scope of this article.
Scenario 2: What we view as a table can also be viewed as a vector/matrix by a mathematician. In fact, the foundation for the relational model is based on tuple relational calculus that defines a table as a set of tuples where each tuple is a vector of related data items. Mathematicians are in connivance with scientists from all disciplines and they like, what they call, n-dimensional vectors. The coordinate systems we learn in school are 2 and 3 dimensional. Many naturally occurring phenomena can apparently be studied using n-dimensional algebra. A table is, strictly speaking, a set of vectors. The dimensions of the vector are defined by the columns of the table. It is difficult for computer scientists who have been working with the relational model for a long time to visualize a table that is multidimensional. The multidimensional data models and cubes used for OLAP are a close but incomplete approximation of what scientists want. It is different because in a scientific database they are not storing aggregate data in the different dimensions as in a OLAP cube. The above scenarios are not complete and there are many different ways in which scientists need to store data. The BLOB data type could be used for unstructured data but again, the BLOB data type has its own limitations including the fact that most database systems cannot index a BLOB based column. This presents a serious limitation with respect to performance specially since the amount of data tends to be quite large in scientific databases.
They may need to know how their data is stored!
Scientific data has a structure that is sometimes inherent to the way we look at data. A three dimensional mesh is one such example. Once the relational model captured the business data processing market, it also implicitly controlled and decided how the world stored its data. From the point of view of a database user, data is stored in tables. Internally, the storage takes many exotic forms. Some vendors store table data in disk blocks(Oracle) while there are also others that store each table in its own disk file. This is called Physical data independence in database parlance. Given the complex structure of scientific data, sometimes it becomes inevitable that the user of the database must be aware of the way the data is actually stored in the database. This requirement could break one of the fundamental advantages of the relational model and effort must be put in to try and prevent the same without sacrificing the data processing needs of the scientific community.
They would like to know how their data came into being and how it evolves
In relational databases, data quality is conveniently left to the user. The database management system can make no promises about the quality of data and this has lead to a lucrative market for data quality products. This is serious business in the scientific world. A classic case in point is the recent story about the UN report on climate change being based on the speculation of a scientist in the opinion column of a magazine. The report was based on the premise that global warming was high enough that the Himalayan Glaciers would all melt away by the year 2035. The UN report used this piece of information as fact while it was mere speculation of a scientist made during a telephonic interview! In the case of a typical relational database, we often use extra columns in tables to denote when the row was created, updated and the user that performed the said changes. In cases where more information is needed, we explicitly create audit tables to audit all changes to a particular table. The reason that this is not the default behaviour is because it imposes an overhead for normal transaction processing.
In contrast, in scientific databases, each piece of information must necessarily be annotated with various bits of information. There is also a type of database called ‘curated databases’ which are typically manually annotated by teams of scientists who have domain information about the data. This is a human intensive task and these databases are typically versioned at the database level. For scientific databases what we need is an automatic provenance management system. Provenance includes two aspects. Workflow Provenance describes how derived data has been generated from raw observational data. Dataflow provenance describes how data has moved through a network of databases. Relational pundits may ask why derived data is stored in a database in the first place. In scientific discovery, the inference from raw observational data, though derived, still needs to be stored! Another interesting aspect of scientific databases is that a piece of information used in a computation may itself be pulled from another database. This information needs to be recorded. For example, the table of isotopes lists the spectral peaks for all radioactive isotopes detected so far. This information is used in disciplines that involve spectral studies of radioactive elements. Provenance management is, perhaps, something that could be added to mainstream relational databases too as many other systems could benefit from it as well. One example is version control systems.
Scientists co-operate with each other
You do not expect bank A to share its customer data with bank B. We definitely hope not! The business software has evolved in silos even within the same organization. Technologies like web services, XML, SOAP etc are trying to bridge the gap between systems and make software systems talk to one another. However, it is not uncommon for bio science researcher in the United States to access data that is maintained by a chemistry department in Switzerland and a protein database maintained in India. Sharing of information is fundamental to scientific progress and scientists the world over know this and practice it. In contrast, a bank, pretty much generates all the data they need via their regular operations. Today, it requires considerable human effort to configure and maintain a truly global heterogeneous federated database where data flows seamlessly from one system to another. Federated databases provide fresh challenges in the areas of distributed transaction processing and metadata management.
Lessons from History
There are lessons that we can learn from the history of relational databases. Two important events have happened that are of interest in this context.
The first is the genesis of the relational model. This happened essentially as an outcome of path breaking work by a brilliant computer scientist, E.F, Codd. He essentially conceptualized the relational model and also suggested a set of 12 rules that can be used to determine if a database implementation can be considered relational. The relational model was a complete definition of a method to store and access data. It suited the needs of most organizations and has stood the test of time. The 12 rules of relational database, provided flexibility and also imposed constraints on what can be called a relational database. Considering the time at which these were proposed, they provided a blueprint for users to model their data requirements for the relational model that is valid even today.
The second important event is the development of Enterprise resource planning (ERP) systems. This did not have as much ‘computer science’ research going into it as research into the best practices of each industry and all common business functions of an enterprise such as finance, payroll, staffing etc.. This provided a packaged solution of best industry practices and many organizations were happy to even alter their existing workflows to adopt an ERP system. At the backend of an ERP system is a giant database that hold all the data that is captured in the regular functioning of an organization. On top of this data, there are workflows that model the actual physical processes of the organization. These systems, though packaged, provide a certain degree of customization. For example, every automobile manufacturer would have some processes that are similar while maintaining some unique processes (often patented!) of their own. ERP systems provided organizations worldwide with a business reason to adopt a information system and in effect a database system to store their organizational data. This lead to considerable investment in the database field that fueled innovation in product development and research.
What can be done
What is the relevance of these events to scientific databases? There is a growing number of articles and discussions about the need to evolve the relational model to support these needs. The opinion on whether the relational model should support these needs is divided. What, we however need are
Scientists have many things they want to store in the database. These include raw observational data, graphs, three dimensional meshes, n-dimensional data, formulae etc.. On top of this data they need data provenance. Can we come up with a unifying all encompassing data model that would satisfy the above requirements? An important aspect here is a new data model might need a minor change in the way scientists look and use data.
The second challenge is in making all scientific equipment talk the same language when it comes to exchanging data. The adoption of standards such as HL7 and DICOM has made allowed the possibility of interoperability of medical imaging equipment from different vendors. It is time for all kinds of scientific equipment to adopt a similar approach. Imagine a function generator that is connected to an oscilloscope that is in turn connected to a particle detector where the final observations are directly sent to a network port to be received by a software service that adds the relevant provenance data and pushes the data to a common database!
Traditionally workflows were not part of the database and with good reason. The role of the database is to store and retrieve data. Workflows are built on top the database layer. What we need however, is an ERP like system for scientific databases. The processes will not be common across the field just as the processes in finance are different from payroll. We can, however, attempt modeling exercise for a particular field. I would for example, like to see if we can define experiments in fiber optics as workflows that can be guided by a software system. What we need is for the database system to support the features and functionality of Matlab’s Simulink.
A transaction in a relational database is well defined based on the ACID properties(Atomicity, Consistency, Isolation and Durability).. Is this definition valid for scientific databases? The workloads on scientific databases are different and current literature seems to suggest that most scientific databases tend to have large number of inserts while being low on update and delete operations. If we bring provenance into the picture an update would mean the creation of a new version of the data item in many cases. We might need to redefine the notion of consistency due to the addition of provenance information inside the database.
Finally, would SQL suffice as the default interface to the database. SQL is declarative and it may not be suitable to the more exotic formats needed to scientific databases. The fundamental challenge here is that scientists are sophisticated users who may not be willing to spend considerable effort in learning a new programming language. What we need is a language that is intuitive to the way a scientist looks and ‘feels’ data!
Conclusion
These are exciting times! There is a critical mass of users who have a need that needs to be addressed by the database community! In a rush to solve this new need let us not alter something that has stood the test of time. Let us use the lessons we have learned well and maybe its not a bad idea to reinvent the wheel in some cases! Scientific databases pose novel challenges such as provenance, complex data types and distributed processing. The database community has focused on all these issues in the past. The work on these topics were carried out independently and the need for these features was never as urgent as it is now. Perhaps, it is time to bring together research from various branches of the database community and address the needs of the scientific community as a whole.
References
Web Links
Posted in Uncategorized | Tagged Data Provenance, Database, Relatinal Database, Scientific Database | 2 Comments »