Information theoritic approach to database normalization

by krishnamoorthyb


There has been recent work on revisiting the definitions of database design using normal forms.  For a long time now, the work of E.F. Codd has been the seminal work on defining the ‘goodness’ of a relational data model. Generations of data modellers have used the definitions of the third normal form and on rare occassions the Boyce Codd normal form to deliver database designs. These definitions are based on a fundamental concept called a functional dependency. A functional dependency is a way to express a contraint that actually exists on data. Some discussions treat the process of database normalization as the removal of functional dependencies. This is very misleading. We should not remove functional dependencies. They exist!

While the notion of functional dependencies and database normal forms have been sufficient for most business needs, it has long been felt that they do not adequately capture all contraints on the data. Furthermore,  newer data models like XML cannot use the same concepts even though they work on the same data. If they work on the same data, the same constraints exist!

A new approach attempts to use the concepts of information theory to redefine database normal forms while retaining their innate essence. What this means is that a relational will be in 3NF irrespective of whether we consider the traditional definition or the new definition based on information theory.

In information theory, a fundamental concept is that of entropy. Entropy is a measure of  uncertainity associated with a random variable. What does this mean to data? Shannon is a scientist who has contributed  a lot to the field of information theory. He asked this question: “What is the information I am missing if I dont know the value of a random variable?” or “What do I lose when I dont know who will win the toss?” This is measured by Shannon’s definition of entropy.

The same notion is extended to the ‘goodness’ of a design.  To use this approach We define our world in terms of data units. Data units will come together to form relations. Treat a data unit as a random variable. I now ask this question “If i remove the value of a data unit what information will i lose?”  Compare this with the notion of a functional dependency of the form name->phonenum. If there is one record of the form (sss,12344), for another record where name=ssss, I dont lose anything if i dont know the phone number as i can deduce that from the first record.  This measure of entropy is then used to define various normal forms. I will write tomorrow about the normal forms in detail.

The concept is very interesting and it can be extended to XML data also. Pretty cool….

The original work can be founf here -> http://portal.acm.org/citation.cfm?id=1059519&dl=

Advertisements