Tuesday, May 4, 2010

NormalForm in DBMS

Introduction

Part of the formal mathematical definition of a relational database includes the concept of “Normal Forms.” Each of the Forms are designed to logically address potential problems in referring to and working with information stored in a database. A database is said to be in one of the Normal Forms if it satisfies the rules required by that Form; it also will not suffer from any of the problems addressed by the Form.

Because this will be a rather informal introduction, it will not be precise enough for a purist. If you want stricter definitions, you should read An Introduction to Database Systems by C.J. Date (Addison-Wesley, publisher). Be forewarned: it’s not easy or fun reading, but it is thorough, accurate, and precise. My discussions of the particular normal forms start with Mr. Date’s definitions.

Relational Databases

Permit me to do a bit of hand-waving, and skip a definition of a relational database; instead, I’ll just give a quick description.

A relational database is made of multiple tables; each table visually resembles a small spreadsheet (but there are no cells with calculations). Each column will have the same kind of information (name, address, etc.) for each row; each row will have the specific information for one real-world entity (person). A key column in each table will provide information on how it relates to other tables.

First Normal Form

A relation is in 1NF if and only if all underlying domains contain scalar values only.

1NF is often refered to the atomic rule. “Atom” comes from a Greek word that meant, essentially, the smallest possible piece of something. Before Rutherford did his experiments and made his famous quote about cannonballs bouncing off tissue paper, what we now call atoms were thought to be the smallest possible piece of a given element.

In a database, this means that each column should only be designed to hold one and only one piece of information. Consider the following examples:

Bad Good
Name Address
Ben Goren Post Office Box 964, Tempe, Arizona 85280-0964
Jane Doe 1234 Main Street, Mesa, Arizona 85345
First_Name Last_Name Street_Address City State ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Jane Doe 1234 Main Street Mesa Arizona 85345
In the bad example, it’s impossible to pull an alphabetical list of people by last name; send a form letter that says, “Dear Ben,” in the salutation but address the envelope to “Mr. Goren,” etc. Similarly, you can’t put the envelopes in ZIP code order, which will save signifantly on bulk mail postage rates. You also can’t get a list of everybody who lives in Mesa but not Tempe. In the good example, all those are possible.

The good example above only works if I have a single address. But I don’t: I don’t live in a small box at Fifth and Mill in Tempe; I live in an apartment which only feels like it’s that small. What to do?

Bad
First_Name Last_Name Street_Addresses City State ZIP
Ben Goren Post Office Box 964
1331 West Third Street Apartment 2 Tempe
Tempe Arizona
Arizona 85280-0964
85281
Jane Doe 1234 Main Street Mesa Arizona 85345
Again, we’ve got more than one value in a cell. What if I want to sit right down and write myself a letter? How do I print only one address?

Incidentally, the United States Post Office considers apartnment numbers, suite numbers, etc., to be part of the street address; they should all appear on one line as a single unit.

The most common—but still incorrect—solution is to add columns:

Bad
First_Name Last_Name Mailing_Street_Address Mailing_City Mailing_State Mailing_ZIP Residence_Street_Address Residence_City Residence_State Residence_ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964 1331 West 3rd Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345
On its face, this seems to solve the problem—but only for a moment. First, Jane only has one address; providing space for her to have a mailing address is wasteful and will likely hurt database performance. But, more importantly…I don’t just have two addresses, I have three. Because I’m often not home, but my parents are, I use thier address as a shipping address. You could add a third column to the table—but what about the person who also has a business address, a summer home address, a French Riviera address, a girlfriend’s-place-where-he-spends-all-his-time address, and so on? You can keep adding address columns…but it’s cumbersome and prone to breaking.

Notice that the grey part above masks an area with an outline that isn’t a rectangle? If a line drawn around the data in your table isn’t a rectangle, or if the table has holes in it, it’s probably not in 1NF.

Nulls

That last paragraph deserves a bit of further explanation: there are occasions where it’s okay to have holes in your data. For example, it would be quite reasonable to include a column for a person’s middle name, as well as first and last names. However, not everybody—I, for example—has a middle name, In this case, you would store a special value for my middle name in the dabase, called a null. This isn’t the word “null,” but rather something that means, essentially, “the value doesn’t exist or is unknown or unknowable.” If you remember your set theory from algebra, a null in a database is the same thing as the empty set. Just as {0} and {} are different, so are “null” and the null value.

Null values are not only permissible but necessary—I’d be pretty darn upset if you made up a middle name for me—but, when you have them, especially a lot of them, in your database, that should be a signal that you need to make certain that the database is in 1NF.

Keys

Before I can discuss further implications of 1NF, I must first discuss the topic of keys.

Elsewhere in the pedantic discussion of databases is the concept of keys and the uniqueness of rows in a database. The idea is that you can’t have duplicate entries in any table; if you did, which—for example—Ben Goren should you update when my address changes? If you only updated one, you’d now have me listed multiple times with different addresses…and which one is current?

To solve that problem, there’s the idea of keys, especially primary keys. A key is a column or combination of columns which will always be unique. The primary key is the one column or combination of columns which is used to identify a given row. Usually, there’s only one column that qualifies as a key, and therefore becomes the primary key by default.

Very often, a column is created solely for the purpose of serving as the primary key. You’re familiar with the idea of getting an order number when you place an order over the phone; you know that you can give that number back to somebody at the company and they’ll know that you mean the order you placed on Wednesday, not Monday. They’ll also know that it’s your order, not your sister’s who lives at the same address.

The table above will need such a column. A quick perusal of the phone book will reveal scores of “John Smiths.” It’s even concievable that John Smith could have a roommate also named John Smith, so you can’t use the combination of name and address as a key. You can think of some unique way of telling the two apart—even if they were born at the same time, they weren’t born in the exact same place, for example—but such information might not be available. Further, having a small number of columns as your primary key will often provide better performance than having a large number of columns.

Most database programs will have a means of identifying more than one key in a given table. This would be done when you want to enforce a requirement that those keyed columns always have unique values for every row. For example, a database version of the periodic table of the elements would have keys on the name, symbol, atomic number, etc., as those values are unique for each element. In this example, which column(s) you chose for your primary key would be part of the art of database design.

Null values are not permissible in a key column.

Keys are also used to relate information from one table to another—but more on that later.

First Normal Form, Continued

The solution to the problem of multiple addresses is to break the information into two tables:

Good
Table: People
ID* First_Name Last_Name
1 Ben Goren
2 Jane Doe
Table: Addresses
Person*->People:ID Kind* Address City State ZIP
1 Mailing Post Office Box 964 Tempe Arizona 85280-0964
1 Residence 1331 West Third Street Apartment 2 Tempe Arizona 85281
2 Mailing 1234 Main Street Mesa Arizona 85345
*Primary key
In the People table, we now have an artifical ID column that serves as the primary key. We’ll let the database program worry about creating unique values for each person; in my examples, I’ll just number them sequentially.

The Addresses table starts with what’s called a foreign key. The values in the person column have a matching value in the ID column of the People table. If you want to know the name of the person who owns the first address, you look for the person who has ID 1 in the People table; same for the sedcond address. The third address belongs to the person with ID 2.

The Addresses table uses a composite primary key. Because a person might have more than one address, the Person column isn’t enough to guarantee uniqueness. Similarly, more than one person has a mailing address, so we can’t use the Kind column. However, each person has only one of each possible kind of address, so the two columns together can function as the primary key.

Relationships

This pair of tables demonstrates what’s called a “one-to-many relationship.” Each one person may (or may not) have many addresses, but each address belongs to one and only one person. If two people lived at the same address, you’d have two lines that were otherwise identical except for the Person column.

There are two other kinds of relationships: one-to-one and many-to-many. Because each person has only one birthdate, you could create a separate table with the person’s ID (from the People table) and their birthdate. This would be a one-to-one relationship. It would be equally valid to include this information in the People table, but you might choose not to do so for logical or performance reasons (again, this is where the art comes in).

A book might have more than one author, and an author might write more than one book; this is a many-to-many relationsihp. To work with such, you must have not two, but three tables:

Good
Table: Authors
ID* First_Name Last_Name
1 John Doe
2 Jane Smith
3 Judith Brown
4 Bill Gates
Table: Books
ID* Book
1 Database Design
2 Learning Spreadsheets
3 Basic Wordprocessing
4 World Conquest
Table: Authors-Books
Author*->Authors:ID Book*->Books:ID
1 1
2 1
3 1
2 2
3 2
1 3
2 3
4 4
*Primary key
The above trio of tables says that John, Jane, and Judith all collaborated to write Database Design; Jane and Judith together wrote Learning Spreadsheets; John and Jane worked with each other on Basic Wordprocessing; and Bill wrote World Conquest all by himself.

Queries

While database design is often about breaking up information, acutally using the information means putting it back together again. To pull information from a database is to query it; the bit of computer code used to pull the information is called a query.

In a database with a pretty visual interface, such as Microsoft Access, queries can often be created simply by pointing and clicking. Most large-scale databases use a computer language called Structured Query Language, or SQL, for the task.

A simple query against a database with the addressbook tables above might look something like this:

SELECT Last_Name, First_Name FROM People

The result might look like this:

Goren Ben
Doe Jane

The power of relational databases comes in joining two tables together in a query. The following bit of code lists every address for every person in the database and prints the results in alphabetical order by last name:

SELECT People.First_Name, People.Last_Name, Addresses.Address, Addresses.City, Addresses.State, Addresses.ZIP FROM People, Addresses WHERE People.ID = Addresses.Person ORDER BY People.Last_Name

The result will look like the following:

Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Ben Goren 1331 West Third Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345

Notice the two lines for me? This isn’t in any normal form, but that’s okay—this is what the database is generating, not what the database is storing. This is exactly the way you want things to work: the database stores the information in an efficient manner (multiple tables), but, for humans, it presents it to you on demand in any manner you want, including redunant ones.

Much of SQL is beyond the scope of this document; it can do amazing things, but you don’t need to know how to do all those amazing things to understand how a database works. I’ll leave you with one last example, which will select only those books that Judith Brown wrote in the above author/books example:

SELECT Authors.First_Name, Authors.Last_Name, Books.Book FROM Authors, Books WHERE Authors.First_Name = "Judith" AND Authors.Last_Name = "Brown" AND Authors.ID = Authors-Books.Author

The result:

Judith Brown Database Design
Judith Brown Learning Spreadsheets

Second Normal Form

(definition asusming only one candidate key, which is thus the primary key): A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key.

Once you have your database in 1NF, you’ve finished the hard part. The other normal forms are important, but not as easy to break or address problems that aren’t as common. An important thing to remember: if the database isn’t in 1NF, it can’t, no way no how, be in 2NF either. A database that is in a higher normal form is also in all of the lower normal forms.

The point of 2NF is to avoid the following:

Bad
ID First_Name Last_Name Spouses_Birthday
3 John Smith 02/07/65
4 Sue Jones 07/26/76
The problem is that the last column relates not to John Smith, but to another person entirely. John has only one spouse, who only has one birthday…but that’s too tenuous a relationship to embed into a single table. Rather, you would want to structure your tables as follows:

Good
Table: People
ID* First_Name Last_Name Birthday
3 John Smith 08/01/62
4 Sue Jones 10/12/75
5 Mary Smith 02/07/65
Table: Spouses
Husband*->People:ID Wife*->People:ID
3 5
*Primary key
This arrangement allows you to know not only John’s wife’s birthday, but everything else you might want to know about her. If they should get divorced or if they re-marry, you only have to change the Spouses table to reflect the changes.

As a side note: the Spouses table, as it’s set up, permits polygamy and polyandry. There are many societies around the globe where this happens regularly, so you might want to leave things this way. If you’re sure that nobody in the database will have more than one spouse, you should specify that each column is a candidate key; that way, only one person can appear in each column. Further constraints and logic could ensure that only men got listed as husbands and women as wives, but that starts to get complex….

Third Normal Form

(definition asusming only one candidate key, which is thus the primary key): A relation is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.

3NF takes the process started in 2NF to its logical conclusion. Columns in a table should have information that is intrinsically a part of the nature of the primary key.

Consider, for example, a company’s ordering and shipping database. If only complete orders ever get shipped, you’d want to include the shipment date with the orders table. If partial orders can be shipped as items arrive, the shipment date belongs with the item detais table. Which of the two choices puts the database in 3NF depends on the underlying business model.

Fourth Normal Form

Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A.

The process of reducing interdependencies between columns (not tables) continues with 4NF. Assuming every teacher uses the same books when teaching the same course, the following table is not in 4NF:

Bad
Course Teacher Textbook
Physics Jones Basic Mechanics
Physics Jones Principles of Optics
Physics Smith Basic Mechanics
Physics Smith Principles of Optics
Math Jones Basic Mechanics
Math Jones Vector Analysis
Math Jones Trigonometry
Looking at it intuitively, there’s a lot of duplication of information. If the school decided to drop Basic Mechanics in favor of Introduction to Mechanics, you’d have a lot of work to update everything.

The solution, again, is to break the table in two:

Good
Table: Course-Teacher
Course Teacher
Physics Jones
Physics Smith
Math Jones
Table: Course-Text
Course Text
Physics Basic Mechanics
Physics Principles of Optics
Math Basic Mechanics
Math Vector Analysis
Math Trigonometry
Naturally, you’d be using foreign keys for most everything—but I’m sure you don’t want to be overwhelmed with an entirely new, large database for one example.

Breaking up the table allows you to worry just about which teacher is teaching which coures, and which books are available to each course.

If different teachers use different books when teaching different courses, then the underlying assumption (a few paragraphs ago) is invalid, and so is this solution. It all depends on what’s going on in the “real world.”

Fifth Normal Form

A relation R is in 5NF—also called projection-join normal form (PJ/NF)—if and only if every join dependency in R is implied by the candidate keys of R.

You might notice that the solution to almost every normalization problem involves breaking up the offending table into two or more parts. The logical conclusion to this process is to break everything up into the smallest possible parts—in other words, you’d have tables with nothing other than primary keys and foreign keys. (That’s an oversimplification, but not much of one.) While 5NF solves nearly all theoretical problems, it introduces many practical ones. It’s good to know that it’s an option to solving problems, but you should think very carefully before putting an entire database in 5NF. Sometimes, 5NF is appropriate for portions of a database…but usually only for those who really know what they’re doing.

Le Fin

While 5NF may be an extreme, it does suggest a good rule of thumb to ensure your database doesn’t suffer from these kinds of problems: “When in doubt, break it up.”

If your database isn’t in 1NF—and almost every database I’ve seen an amateur design hasn’t been—you’re in for plenty of trouble and headaches. You’re likely to spend as much time maintaining the database as you will actually using it. A database that isn’t in 1NF might actually be worse than a card catalog: my experience has shown me that people who don’t know enough about database design to at least put the database in 1NF also don’t know how to back up and archive data; if you don’t keep good backups, your computer data is much more fragile and prone to loss and corruption than paper records kept in a labeled filing cabinet.

Once you get over the hurdle of 1NF, using the same basic kind of intuitive thinking is likely to help you put the database in 4NF with relative ease. With many simple databases, the process of putting them in 1NF incidentally also puts them in 4NF—but don’t rely on that generalization to think you don’t need to use a critical eye before declaring a database design finished.

For that matter, very few databases are ever done in their design. The world changes, and so does the data about it. Assumptions that once were valid are now invalid; things we take for granted were impossible a generation ago. After the database designer is done, the database administrator must keep a keen watch over the database and make sure that everything keeps working smoothly, tweaking things as time passes.

NormalForm in DBMS

A Simple Guide to Five Normal Forms in Relational Database Theory

> 1 INTRODUCTION . . . 2
> 2 FIRST NORMAL FORM . . . 2
> 3 SECOND AND THIRD NORMAL FORMS . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
> 4 FOURTH AND FIFTH NORMAL FORMS . . . 5
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 5 UNAVOIDABLE REDUNDANCIES . . . 12
> 6 INTER-RECORD REDUNDANCY . . . 13
> 7 CONCLUSION . . . 13
> 8 ACKNOWLEDGMENT . . . 14
> 9 REFERENCES . . . 14

1 INTRODUCTION

The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

2 FIRST NORMAL FORM

First normal form [1] deals with the "shape" of a record type.

Under first normal form, all occurrences of a record type must contain the same number of fields.

First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.

3 SECOND AND THIRD NORMAL FORMS

Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.

Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.

We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

3.1 Second Normal Form

Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:

---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

The warehouse address is repeated in every record that refers to a part stored in that warehouse.
If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

------------------------------- ---------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
====================----------- =============--------------------
When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.

3.2 Third Normal Form

Third normal form is violated when a non-key field is a fact about another non-key field, as in

------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

The department's location is repeated in the record of every employee assigned to that department.
If the location of the department changes, every such record must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
If a department has no employees, there may be no record in which to keep the department's location.
To satisfy third normal form, the record shown above should be decomposed into the two records:

------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

3.3 Functional Dependencies

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:

----------------------------------------------
| PERSON | ADDRESS |
-------------+--------------------------------
| John Smith | 123 Main St., New York |
| John Smith | 321 Center St., San Francisco |
----------------------------------------------
Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.

---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------
We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------
However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence no violation of third normal form.

4 FOURTH AND FIFTH NORMAL FORMS

Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

4.1 Fourth Normal Form

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records

-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
(b) Minimal number of records, with null values:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
(c) Unrestricted:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:

If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.
Fourth normal form minimizes such update problems.

4.1.1 Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:

-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.

4.1.2 Multivalued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

An employee uses certain skills on certain projects.
An employee uses certain languages on certain projects.
If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

4.2 Fifth Normal Form

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------
In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- | GM | truck | -------------------
---------------------
These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------
Observe that:

Jones sells cars and GM makes cars, but Jones does not represent GM.
Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
Brown represents Ford and Brown sells buses, but Ford does not make buses.
Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.

5 UNAVOIDABLE REDUNDANCIES

Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.

6 INTER-RECORD REDUNDANCY

The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundanciesæ.

Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:

------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------
In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.

Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].

7 CONCLUSION

While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:

Single-valued vs. multi-valued facts.
Dependency on the entire key.
Independent vs. dependent facts.
The presence of mutual constraints.
The presence of non-unique or non-singular representations.
And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.

8 ACKNOWLEDGMENTS

I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.