Tuesday, May 4, 2010

NormalForm in DBMS

Introduction

Part of the formal mathematical definition of a relational database includes the concept of “Normal Forms.” Each of the Forms are designed to logically address potential problems in referring to and working with information stored in a database. A database is said to be in one of the Normal Forms if it satisfies the rules required by that Form; it also will not suffer from any of the problems addressed by the Form.

Because this will be a rather informal introduction, it will not be precise enough for a purist. If you want stricter definitions, you should read An Introduction to Database Systems by C.J. Date (Addison-Wesley, publisher). Be forewarned: it’s not easy or fun reading, but it is thorough, accurate, and precise. My discussions of the particular normal forms start with Mr. Date’s definitions.

Relational Databases

Permit me to do a bit of hand-waving, and skip a definition of a relational database; instead, I’ll just give a quick description.

A relational database is made of multiple tables; each table visually resembles a small spreadsheet (but there are no cells with calculations). Each column will have the same kind of information (name, address, etc.) for each row; each row will have the specific information for one real-world entity (person). A key column in each table will provide information on how it relates to other tables.

First Normal Form

A relation is in 1NF if and only if all underlying domains contain scalar values only.

1NF is often refered to the atomic rule. “Atom” comes from a Greek word that meant, essentially, the smallest possible piece of something. Before Rutherford did his experiments and made his famous quote about cannonballs bouncing off tissue paper, what we now call atoms were thought to be the smallest possible piece of a given element.

In a database, this means that each column should only be designed to hold one and only one piece of information. Consider the following examples:

Bad Good
Name Address
Ben Goren Post Office Box 964, Tempe, Arizona 85280-0964
Jane Doe 1234 Main Street, Mesa, Arizona 85345
First_Name Last_Name Street_Address City State ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Jane Doe 1234 Main Street Mesa Arizona 85345
In the bad example, it’s impossible to pull an alphabetical list of people by last name; send a form letter that says, “Dear Ben,” in the salutation but address the envelope to “Mr. Goren,” etc. Similarly, you can’t put the envelopes in ZIP code order, which will save signifantly on bulk mail postage rates. You also can’t get a list of everybody who lives in Mesa but not Tempe. In the good example, all those are possible.

The good example above only works if I have a single address. But I don’t: I don’t live in a small box at Fifth and Mill in Tempe; I live in an apartment which only feels like it’s that small. What to do?

Bad
First_Name Last_Name Street_Addresses City State ZIP
Ben Goren Post Office Box 964
1331 West Third Street Apartment 2 Tempe
Tempe Arizona
Arizona 85280-0964
85281
Jane Doe 1234 Main Street Mesa Arizona 85345
Again, we’ve got more than one value in a cell. What if I want to sit right down and write myself a letter? How do I print only one address?

Incidentally, the United States Post Office considers apartnment numbers, suite numbers, etc., to be part of the street address; they should all appear on one line as a single unit.

The most common—but still incorrect—solution is to add columns:

Bad
First_Name Last_Name Mailing_Street_Address Mailing_City Mailing_State Mailing_ZIP Residence_Street_Address Residence_City Residence_State Residence_ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964 1331 West 3rd Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345
On its face, this seems to solve the problem—but only for a moment. First, Jane only has one address; providing space for her to have a mailing address is wasteful and will likely hurt database performance. But, more importantly…I don’t just have two addresses, I have three. Because I’m often not home, but my parents are, I use thier address as a shipping address. You could add a third column to the table—but what about the person who also has a business address, a summer home address, a French Riviera address, a girlfriend’s-place-where-he-spends-all-his-time address, and so on? You can keep adding address columns…but it’s cumbersome and prone to breaking.

Notice that the grey part above masks an area with an outline that isn’t a rectangle? If a line drawn around the data in your table isn’t a rectangle, or if the table has holes in it, it’s probably not in 1NF.

Nulls

That last paragraph deserves a bit of further explanation: there are occasions where it’s okay to have holes in your data. For example, it would be quite reasonable to include a column for a person’s middle name, as well as first and last names. However, not everybody—I, for example—has a middle name, In this case, you would store a special value for my middle name in the dabase, called a null. This isn’t the word “null,” but rather something that means, essentially, “the value doesn’t exist or is unknown or unknowable.” If you remember your set theory from algebra, a null in a database is the same thing as the empty set. Just as {0} and {} are different, so are “null” and the null value.

Null values are not only permissible but necessary—I’d be pretty darn upset if you made up a middle name for me—but, when you have them, especially a lot of them, in your database, that should be a signal that you need to make certain that the database is in 1NF.

Keys

Before I can discuss further implications of 1NF, I must first discuss the topic of keys.

Elsewhere in the pedantic discussion of databases is the concept of keys and the uniqueness of rows in a database. The idea is that you can’t have duplicate entries in any table; if you did, which—for example—Ben Goren should you update when my address changes? If you only updated one, you’d now have me listed multiple times with different addresses…and which one is current?

To solve that problem, there’s the idea of keys, especially primary keys. A key is a column or combination of columns which will always be unique. The primary key is the one column or combination of columns which is used to identify a given row. Usually, there’s only one column that qualifies as a key, and therefore becomes the primary key by default.

Very often, a column is created solely for the purpose of serving as the primary key. You’re familiar with the idea of getting an order number when you place an order over the phone; you know that you can give that number back to somebody at the company and they’ll know that you mean the order you placed on Wednesday, not Monday. They’ll also know that it’s your order, not your sister’s who lives at the same address.

The table above will need such a column. A quick perusal of the phone book will reveal scores of “John Smiths.” It’s even concievable that John Smith could have a roommate also named John Smith, so you can’t use the combination of name and address as a key. You can think of some unique way of telling the two apart—even if they were born at the same time, they weren’t born in the exact same place, for example—but such information might not be available. Further, having a small number of columns as your primary key will often provide better performance than having a large number of columns.

Most database programs will have a means of identifying more than one key in a given table. This would be done when you want to enforce a requirement that those keyed columns always have unique values for every row. For example, a database version of the periodic table of the elements would have keys on the name, symbol, atomic number, etc., as those values are unique for each element. In this example, which column(s) you chose for your primary key would be part of the art of database design.

Null values are not permissible in a key column.

Keys are also used to relate information from one table to another—but more on that later.

First Normal Form, Continued

The solution to the problem of multiple addresses is to break the information into two tables:

Good
Table: People
ID* First_Name Last_Name
1 Ben Goren
2 Jane Doe
Table: Addresses
Person*->People:ID Kind* Address City State ZIP
1 Mailing Post Office Box 964 Tempe Arizona 85280-0964
1 Residence 1331 West Third Street Apartment 2 Tempe Arizona 85281
2 Mailing 1234 Main Street Mesa Arizona 85345
*Primary key
In the People table, we now have an artifical ID column that serves as the primary key. We’ll let the database program worry about creating unique values for each person; in my examples, I’ll just number them sequentially.

The Addresses table starts with what’s called a foreign key. The values in the person column have a matching value in the ID column of the People table. If you want to know the name of the person who owns the first address, you look for the person who has ID 1 in the People table; same for the sedcond address. The third address belongs to the person with ID 2.

The Addresses table uses a composite primary key. Because a person might have more than one address, the Person column isn’t enough to guarantee uniqueness. Similarly, more than one person has a mailing address, so we can’t use the Kind column. However, each person has only one of each possible kind of address, so the two columns together can function as the primary key.

Relationships

This pair of tables demonstrates what’s called a “one-to-many relationship.” Each one person may (or may not) have many addresses, but each address belongs to one and only one person. If two people lived at the same address, you’d have two lines that were otherwise identical except for the Person column.

There are two other kinds of relationships: one-to-one and many-to-many. Because each person has only one birthdate, you could create a separate table with the person’s ID (from the People table) and their birthdate. This would be a one-to-one relationship. It would be equally valid to include this information in the People table, but you might choose not to do so for logical or performance reasons (again, this is where the art comes in).

A book might have more than one author, and an author might write more than one book; this is a many-to-many relationsihp. To work with such, you must have not two, but three tables:

Good
Table: Authors
ID* First_Name Last_Name
1 John Doe
2 Jane Smith
3 Judith Brown
4 Bill Gates
Table: Books
ID* Book
1 Database Design
2 Learning Spreadsheets
3 Basic Wordprocessing
4 World Conquest
Table: Authors-Books
Author*->Authors:ID Book*->Books:ID
1 1
2 1
3 1
2 2
3 2
1 3
2 3
4 4
*Primary key
The above trio of tables says that John, Jane, and Judith all collaborated to write Database Design; Jane and Judith together wrote Learning Spreadsheets; John and Jane worked with each other on Basic Wordprocessing; and Bill wrote World Conquest all by himself.

Queries

While database design is often about breaking up information, acutally using the information means putting it back together again. To pull information from a database is to query it; the bit of computer code used to pull the information is called a query.

In a database with a pretty visual interface, such as Microsoft Access, queries can often be created simply by pointing and clicking. Most large-scale databases use a computer language called Structured Query Language, or SQL, for the task.

A simple query against a database with the addressbook tables above might look something like this:

SELECT Last_Name, First_Name FROM People

The result might look like this:

Goren Ben
Doe Jane

The power of relational databases comes in joining two tables together in a query. The following bit of code lists every address for every person in the database and prints the results in alphabetical order by last name:

SELECT People.First_Name, People.Last_Name, Addresses.Address, Addresses.City, Addresses.State, Addresses.ZIP FROM People, Addresses WHERE People.ID = Addresses.Person ORDER BY People.Last_Name

The result will look like the following:

Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Ben Goren 1331 West Third Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345

Notice the two lines for me? This isn’t in any normal form, but that’s okay—this is what the database is generating, not what the database is storing. This is exactly the way you want things to work: the database stores the information in an efficient manner (multiple tables), but, for humans, it presents it to you on demand in any manner you want, including redunant ones.

Much of SQL is beyond the scope of this document; it can do amazing things, but you don’t need to know how to do all those amazing things to understand how a database works. I’ll leave you with one last example, which will select only those books that Judith Brown wrote in the above author/books example:

SELECT Authors.First_Name, Authors.Last_Name, Books.Book FROM Authors, Books WHERE Authors.First_Name = "Judith" AND Authors.Last_Name = "Brown" AND Authors.ID = Authors-Books.Author

The result:

Judith Brown Database Design
Judith Brown Learning Spreadsheets

Second Normal Form

(definition asusming only one candidate key, which is thus the primary key): A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key.

Once you have your database in 1NF, you’ve finished the hard part. The other normal forms are important, but not as easy to break or address problems that aren’t as common. An important thing to remember: if the database isn’t in 1NF, it can’t, no way no how, be in 2NF either. A database that is in a higher normal form is also in all of the lower normal forms.

The point of 2NF is to avoid the following:

Bad
ID First_Name Last_Name Spouses_Birthday
3 John Smith 02/07/65
4 Sue Jones 07/26/76
The problem is that the last column relates not to John Smith, but to another person entirely. John has only one spouse, who only has one birthday…but that’s too tenuous a relationship to embed into a single table. Rather, you would want to structure your tables as follows:

Good
Table: People
ID* First_Name Last_Name Birthday
3 John Smith 08/01/62
4 Sue Jones 10/12/75
5 Mary Smith 02/07/65
Table: Spouses
Husband*->People:ID Wife*->People:ID
3 5
*Primary key
This arrangement allows you to know not only John’s wife’s birthday, but everything else you might want to know about her. If they should get divorced or if they re-marry, you only have to change the Spouses table to reflect the changes.

As a side note: the Spouses table, as it’s set up, permits polygamy and polyandry. There are many societies around the globe where this happens regularly, so you might want to leave things this way. If you’re sure that nobody in the database will have more than one spouse, you should specify that each column is a candidate key; that way, only one person can appear in each column. Further constraints and logic could ensure that only men got listed as husbands and women as wives, but that starts to get complex….

Third Normal Form

(definition asusming only one candidate key, which is thus the primary key): A relation is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.

3NF takes the process started in 2NF to its logical conclusion. Columns in a table should have information that is intrinsically a part of the nature of the primary key.

Consider, for example, a company’s ordering and shipping database. If only complete orders ever get shipped, you’d want to include the shipment date with the orders table. If partial orders can be shipped as items arrive, the shipment date belongs with the item detais table. Which of the two choices puts the database in 3NF depends on the underlying business model.

Fourth Normal Form

Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A.

The process of reducing interdependencies between columns (not tables) continues with 4NF. Assuming every teacher uses the same books when teaching the same course, the following table is not in 4NF:

Bad
Course Teacher Textbook
Physics Jones Basic Mechanics
Physics Jones Principles of Optics
Physics Smith Basic Mechanics
Physics Smith Principles of Optics
Math Jones Basic Mechanics
Math Jones Vector Analysis
Math Jones Trigonometry
Looking at it intuitively, there’s a lot of duplication of information. If the school decided to drop Basic Mechanics in favor of Introduction to Mechanics, you’d have a lot of work to update everything.

The solution, again, is to break the table in two:

Good
Table: Course-Teacher
Course Teacher
Physics Jones
Physics Smith
Math Jones
Table: Course-Text
Course Text
Physics Basic Mechanics
Physics Principles of Optics
Math Basic Mechanics
Math Vector Analysis
Math Trigonometry
Naturally, you’d be using foreign keys for most everything—but I’m sure you don’t want to be overwhelmed with an entirely new, large database for one example.

Breaking up the table allows you to worry just about which teacher is teaching which coures, and which books are available to each course.

If different teachers use different books when teaching different courses, then the underlying assumption (a few paragraphs ago) is invalid, and so is this solution. It all depends on what’s going on in the “real world.”

Fifth Normal Form

A relation R is in 5NF—also called projection-join normal form (PJ/NF)—if and only if every join dependency in R is implied by the candidate keys of R.

You might notice that the solution to almost every normalization problem involves breaking up the offending table into two or more parts. The logical conclusion to this process is to break everything up into the smallest possible parts—in other words, you’d have tables with nothing other than primary keys and foreign keys. (That’s an oversimplification, but not much of one.) While 5NF solves nearly all theoretical problems, it introduces many practical ones. It’s good to know that it’s an option to solving problems, but you should think very carefully before putting an entire database in 5NF. Sometimes, 5NF is appropriate for portions of a database…but usually only for those who really know what they’re doing.

Le Fin

While 5NF may be an extreme, it does suggest a good rule of thumb to ensure your database doesn’t suffer from these kinds of problems: “When in doubt, break it up.”

If your database isn’t in 1NF—and almost every database I’ve seen an amateur design hasn’t been—you’re in for plenty of trouble and headaches. You’re likely to spend as much time maintaining the database as you will actually using it. A database that isn’t in 1NF might actually be worse than a card catalog: my experience has shown me that people who don’t know enough about database design to at least put the database in 1NF also don’t know how to back up and archive data; if you don’t keep good backups, your computer data is much more fragile and prone to loss and corruption than paper records kept in a labeled filing cabinet.

Once you get over the hurdle of 1NF, using the same basic kind of intuitive thinking is likely to help you put the database in 4NF with relative ease. With many simple databases, the process of putting them in 1NF incidentally also puts them in 4NF—but don’t rely on that generalization to think you don’t need to use a critical eye before declaring a database design finished.

For that matter, very few databases are ever done in their design. The world changes, and so does the data about it. Assumptions that once were valid are now invalid; things we take for granted were impossible a generation ago. After the database designer is done, the database administrator must keep a keen watch over the database and make sure that everything keeps working smoothly, tweaking things as time passes.

No comments:

Post a Comment