Database Normalization as I remember it
Database normalization is a topic that is discussed much more than it is really understood or practiced. Database normalization is the responsibility of a data modeler. This role is often loosely defined in software projects and more often than not the engineer assigned this task rarely finds the time or the reason to justify the effort on database normalization.
Being interested in database design, I have tried to read and understand every book, paper and article that I could get hold of. Reading multiple sources has actually left me confused! The reason for this is that there are many schools of thought with respect to database normalization. Some use the notion of ‘Functional Dependencies’ while others do not. Many discussions use the notion of an attribute and try to define all normal forms in one line – ‘An attribute depends on the key, nothing but the key and the whole key’. The faction that uses functional dependencies is not consistent in itself. Some say what functional dependencies are allowed while others say what are not! Some use the notion of closure of a set of functional dependencies while others don’t. I have difficulty in visualizing a complex web of interconnections which is what happens if you try to understand everyone’s definition of normal forms. If you are a reader and have a good understanding of normal forms, maybe you should not read mine too (if you are like me). With time, I have come to understand that one of the ways to understand a topic is to try and get into the mind of the engineer/scientist who proposed it originally. I read E.F.Codd and also read and re read C.J. Date. I now think I have understood what they meant and I hope my understanding is correct.
In the following passages I use the term table and relation interchangeably. Similarly I use the terms column and attribute interchangeably. They mean the same thing in the context of this article. In the same manner a tuple and a row are identical.
Why normal forms?
Normal forms are a way to develop relational models that avoid unnecessary redundancy of data. The problem with redundant data is NOT the extra storage required. The problem actually is:
- Maintaining duplicated data consistent across insert, delete and update operations
- The inability to represent some information.
I find the second reason rather fascinating! One may argue that the first reason can be offset by extra code but how do we mitigate the second risk?
Consider a relation used to store products and supplier information. The information regarding products and suppliers is stored in a single table as shown below. The table shows the purchase of products from a supplier on different dates.
|p001||monitor||samsung01||9188888||Jan – 1 – 2009|
|p001||monitor||samsung01||9188888||Feb – 1 – 2009|
|p002||printer||samsung01||9177777||Feb – 15 – 2009|
|p003||keyboard||logitech01||9166666||Nov – 23 – 2009|
|p003||keyboard||microsoft01||9155555||Mar – 15 – 2010|
Assume the following are true about the real world.
- A prod-id uniquely identifies a product and it cannot be null.
- A Supplier_id uniquely identifies a supplier and it cannot be null.
- A product can be purchased from more than one supplier as can be seen in the table above for the keyboard.
For the above table, the primary key is (prod_id,supplier_id,purchase_date). This means that the combination of the three columns should be unique for every row in the table. Formal definitions of keys follow later in the article.
Given that this is the complete schema, all of the following scenarios would be a ‘problem’
- Delete of the last row (keyboard from microsoft). If we delete the row, We lose the contact phone number of the supplier as there is no separate table to store supplier information.
- Change the phone number of an existing supplier. To do this, we need to update every row that contains the supplier id. Ideally, we should be storing the phone number only once.
- Add a new column to store the name of the supplier. Again we would be adding the supplier name multiple times, once for each purchase made from the supplier.
- Add a new supplier for a new set of products without actually purchasing from them. If we did manage to add a new supplier (and the name of the supplier) it would look like this:
|p001||monitor||samsung01||ABC trading||9188888||Jan – 1 – 2009|
|p001||monitor||samsung01||ABC trading||9188888||Feb – 1 – 2009|
|p002||printer||samsung01||ABC trading||9177777||Feb – 15 – 2009|
|p003||keyboard||logitech01||First Comp||9166666||Nov – 23 – 2009|
|p003||keyboard||microsoft01||Digital Life||9155555||Mar – 15 – 2010|
|p004||LED Monitor||acer01||Acer Corp||9144444||null|
In trying to add a new supplier, we have added the supplier id and phone number twice – once for each product from the supplier. We have also made the purchase date null. This would not even be allowed in a real scenario as we would want the purchase date to be a non null column. This ‘inability’ to represent the information that acer01 is a supplier of netbooks and LED monitors is the second reason why we need database normal forms alluded to above.
so, redundant data is bad! very bad!
It is sometimes easy to spot data redundancies in design when we are looking at live data as in the example above. What we, however need is a generic method to deduce data redundancies without even looking at sample data. Database normalization is one way to identify potential redundancies in design. Codd originally proposed three normal forms. They are called the First, Second and Third Normal forms. Due to a case not anticipated, the third normal form was revised and proposed as an entirely new normal form called Boyce Codd normal form(BCNF). There are many more normal forms but they are rarely used in practice. Here are the definitions of the normal forms as I would like to remember and use them! However, a few definitions are in order first!
It is easy to understand normal forms using the notion of a functional dependency(FD). A functional dependency is a relationship between two or more columns of a table. Given a relation R and a set of attributes alpha and beta, a functional dependency between alpha and beta is represented as
alpha –> beta
and is defined as:
Given any two tuples(rows) t1 and t2 of R:
if t1(alpha) = t2(alpha) it is also implied that t1(beta) = t2(beta).
In the products table above, the functional dependencies that need to be valid are:
prod_id –> prod_name
supplier_id –> supplier_name
supplier_id –> supplier_phone
It is obvious to see that functional dependencies model constraints that we would like our data to satisfy. We would not like two different products with the same product id, would we?
Derived Functional Dependencies
Given a set of FDs it is possible to derive more FDs that are logically implied by the original set of FDs. The complete set of FDs is called the closure of a set of FDs. The closure can be a pretty large set and we have a smaller set that is equivalent to the closure called the canonical cover. For purposes of simplicity we will not get into the details of this in this article.
Keys (candidate and primary)
A table should have one or more set of columns that uniquely identify every row in the table. The column or columns that contain unique(not null and different) values for every row of the table is called a key. A key can be a simple key or a compound key. A simple key contains a single attribute while a compound key is an unordered collection of attributes. A table can have one or more such set of attributes that form a key. Each such key is called a candidate key. One of these candidate keys is designated as a primary key. One of the reasons for designating a primary key is to use it in foreign key relationships. I am using the term key to refer to primary and candidate keys.
A key can also be defined as a functional dependency.
Given a key ‘K’ for a relation R, the following functional dependency is true:
K –> R
where R stands for all the attributes of the relation.
In the products and suppliers example above, we have a compound key and that is the only key on the table. In terms of a functional dependency (FD), we can say that the following FD holds
(product_id,supplier_id,purchase_date) –> (supplier_name, prod_name)
Let us add a new column called row_id that would hold a unique value for each row. The table now looks like this:
|001||p001||monitor||samsung01||ABC trading||9188888||Jan – 1 – 2009|
|002||p001||monitor||samsung01||ABC trading||9188888||Feb – 1 – 2009|
|003||p002||printer||samsung01||ABC trading||9177777||Feb – 15 – 2009|
|004||p003||keyboard||logitech01||First Comp||9166666||Nov – 23 – 2009|
|005||p003||keyboard||microsoft01||Digital Life||9155555||Mar – 15 – 2010|
|006||p004||LED Monitor||acer01||Acer Corp||9144444||null|
We now have two keys for the relation.
Key 1: (row_id)
Key 2: (prod_id, supplier_id, purchase_date)
Key 1 is a simple key while key 2 is a compound key. Key 1 and key 2 are called candidate keys. Let us designate Key 1 as the primary key. (It is a good practice to designate a simple key as a primary key).
Key Column and Partial Key Column:
If a candidate key is a simple key, the column is also called a key column. row_id in the example above is a key column. If the candidate key is a compound key, the columns that constitute the candidate key are all called partial key columns. Strictly, any subset of a key is called a partial key. For key 2 in the example above all of the following are partial keys
A column that is not a key column or a partial key column is called a non-key column. In effect all columns can be classified as key, partial key or non-key columns.
The definition of Normal Forms
It is easy to deduce the normal forms based on the definition of a functional dependency(FD). Since the First Normal Form (1NF) does not depend on FDs, here is the definition of 1NF.
First Normal Form:
The domain of every column should be atomic.
This actually excludes user defined data types. It also excludes clubbing the values of one or more columns as a single column. In a strict sense even storing the full name of a person as (firstname,lastname) in a single column is a violation of the first normal form!
The following table violates first normal form because of columns Name and Dependents. The Name column actually stores three things: the Title, First Name and Last Name. The Dependents column is storing the list of dependents and their ages in a single list! The fact that a cat is not a dependent cannot be handled by the relational model!
|778324673||Mr. Bill,Bryson||(Sara, 32, James,12)|
|778324456||Dr. John, Smith||(Molly, 25, Peter,8, Cat, 3)|
Higher Normal Forms
Second Normal From (2NF), Third Normal Form (3NF) and Boyce Codd Normal Form (BCNF) are defined in terms of the functional dependencies they allow on the relation. It was a brilliant idea of Codd to look at functional dependencies. Why? Look at a functional dependency of the form
alpha –> beta
What this says is that whenever alpha is the same(repeats) for two rows, beta would also be the same(repeat). Redundancy creeps in when we have a table that stores alpha, beta and other columns that do not repeat when alpha repeats. We repeat the value of beta every time alpha repeats. This repetition of beta is data redundancy. To avoid this redundancy, what if I can make sure that alpha never repeats in a given table? I can do that simply by making sure that alpha is a key of the table! This can be done if I create a table that contains only the columns of alpha and beta. This is the main aim of database normalization!
Functional Dependency Templates
Given the FD alpha->beta, what are the possible types that alpha and beta can have?
alpha could be simple or composite
beta could be simple or composite
We can ignore the case where beta is composite because it can be rewritten as multiple FDs where alpha is the same and each element of beta forms the dependent attribute of the FD.
For ex, AB->CD is equivalent to AB->C and AB->D.
Since a column can only be a key, a partial-key or a non-key, the only FDs possible are of the form
Based on the definitions of a FD and a key, 8 and 9 are not possible! Similarly 6 is not possible because, in that case the partial-key would also become a key. From the point of view of avoiding data redundancies, FDs 1,2 and 3 are ideal since we are guaranteed that a key would not repeat within a table. Removing the FDs that are not possible and the FDs 1,2 and 3 since they do not contribute to data redundancies, we now have only the following ‘undesirable’ forms
- 4. partial-key->non-key
- 5. partial-key->partial-key
- 7. non-key->non-key
A table should not have any FDs of the above forms . However, we cannot wish them away! A functional dependency models a real world constraint and our database design must be able to enforce that. 2NF, 3NF and BCNF each preclude one of the above types of FDs existing on a relation. It is also important to note that the normal forms are defined in a an ordered fashion. For a table to satisfy the conditions of 3NF it must be in 1NF and 2NF. Similarly for a table to satisfy BCNF it must be in 1NF, 2NF and 3NF.
What if a table satisfies the conditions for 2NF and violates the condition for 3NF? The table is said to be in 2NF and a decomposition method is defined to convert the table to a set of tables each of which would satisfy the conditions of 3NF. Decomposition algorithms are specified for all three normal forms – 2NF, 3NF and BCNF. It i snow easy to define the three remaining normal forms
Second Normal Form (2NF)
A table is in 2NF if it is in 1NF and it has no FD of the form
partial-key –> non-key
Third Normal Form (3NF)
A table is in 3NF if it is in 2NF and it has no FD of the form
non-key –> non-key
Boyce Codd Normal Form (BCNF)
A table is in BCNF if it is in 3NF and it has no FD of the form
partial-key –> partial-key
As mentioned earlier, if a table violates a normal form, we use a decomposition algorithm to decompose the table into one or more tables. This decomposition algorithm is different for each normal form. I will discuss decomposition and related issues in a separate article.
A parting note though.. On what set of FDs should we check the above conditions? Ideally it should be on the closure of the original set of FDs. The closure can be computed from the original set by repeated application of the Armstrong’s axioms for functional dependencies. Again, I am not discussing Armstrong’s axioms here. The closure could be a large set and there are ways to handle this too. I Will discuss them too in a separate article. The purpose of this article is to help quickly identify if a given FD violates a specified normal form. If that purpose is satisfied, the purpose of writing this is satisfied.
Database normal forms help us in removing data duplication in relational databases. Data duplication is bad because it makes insert, update and delete operations complicated if we want to maintain data consistency. Database normal forms help us identify potential data duplication and help us remove them. Normal forms use the notion of Functional Dependencies which are a kind of dependency between the columns of a table. Each normal form precludes certain FDs on a table. If an FD violates a normal form, we do not throw it away. We decompose the table into more table so that each of the decomposed tables satisfy our desired normal forms. This decomposition has some side effects that are not discussed here.