Introduction
Part of the formal mathematical definition of a relational database includes the concept of “Normal Forms.” Each of the Forms are designed to logically address potential problems in referring to and working with information stored in a database. A database is said to be in one of the Normal Forms if it satisfies the rules required by that Form; it also will not suffer from any of the problems addressed by the Form.
Because this will be a rather informal introduction, it will not be precise enough for a purist. If you want stricter definitions, you should read An Introduction to Database Systems by C.J. Date (Addison-Wesley, publisher). Be forewarned: it’s not easy or fun reading, but it is thorough, accurate, and precise. My discussions of the particular normal forms start with Mr. Date’s definitions.
Relational Databases
Permit me to do a bit of hand-waving, and skip a definition of a relational database; instead, I’ll just give a quick description.
A relational database is made of multiple tables; each table visually resembles a small spreadsheet (but there are no cells with calculations). Each column will have the same kind of information (name, address, etc.) for each row; each row will have the specific information for one real-world entity (person). A key column in each table will provide information on how it relates to other tables.
First Normal Form
A relation is in 1NF if and only if all underlying domains contain scalar values only.
1NF is often refered to the atomic rule. “Atom” comes from a Greek word that meant, essentially, the smallest possible piece of something. Before Rutherford did his experiments and made his famous quote about cannonballs bouncing off tissue paper, what we now call atoms were thought to be the smallest possible piece of a given element.
In a database, this means that each column should only be designed to hold one and only one piece of information. Consider the following examples:
Bad Good
Name Address
Ben Goren Post Office Box 964, Tempe, Arizona 85280-0964
Jane Doe 1234 Main Street, Mesa, Arizona 85345
First_Name Last_Name Street_Address City State ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Jane Doe 1234 Main Street Mesa Arizona 85345
In the bad example, it’s impossible to pull an alphabetical list of people by last name; send a form letter that says, “Dear Ben,” in the salutation but address the envelope to “Mr. Goren,” etc. Similarly, you can’t put the envelopes in ZIP code order, which will save signifantly on bulk mail postage rates. You also can’t get a list of everybody who lives in Mesa but not Tempe. In the good example, all those are possible.
The good example above only works if I have a single address. But I don’t: I don’t live in a small box at Fifth and Mill in Tempe; I live in an apartment which only feels like it’s that small. What to do?
Bad
First_Name Last_Name Street_Addresses City State ZIP
Ben Goren Post Office Box 964
1331 West Third Street Apartment 2 Tempe
Tempe Arizona
Arizona 85280-0964
85281
Jane Doe 1234 Main Street Mesa Arizona 85345
Again, we’ve got more than one value in a cell. What if I want to sit right down and write myself a letter? How do I print only one address?
Incidentally, the United States Post Office considers apartnment numbers, suite numbers, etc., to be part of the street address; they should all appear on one line as a single unit.
The most common—but still incorrect—solution is to add columns:
Bad
First_Name Last_Name Mailing_Street_Address Mailing_City Mailing_State Mailing_ZIP Residence_Street_Address Residence_City Residence_State Residence_ZIP
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964 1331 West 3rd Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345
On its face, this seems to solve the problem—but only for a moment. First, Jane only has one address; providing space for her to have a mailing address is wasteful and will likely hurt database performance. But, more importantly…I don’t just have two addresses, I have three. Because I’m often not home, but my parents are, I use thier address as a shipping address. You could add a third column to the table—but what about the person who also has a business address, a summer home address, a French Riviera address, a girlfriend’s-place-where-he-spends-all-his-time address, and so on? You can keep adding address columns…but it’s cumbersome and prone to breaking.
Notice that the grey part above masks an area with an outline that isn’t a rectangle? If a line drawn around the data in your table isn’t a rectangle, or if the table has holes in it, it’s probably not in 1NF.
Nulls
That last paragraph deserves a bit of further explanation: there are occasions where it’s okay to have holes in your data. For example, it would be quite reasonable to include a column for a person’s middle name, as well as first and last names. However, not everybody—I, for example—has a middle name, In this case, you would store a special value for my middle name in the dabase, called a null. This isn’t the word “null,” but rather something that means, essentially, “the value doesn’t exist or is unknown or unknowable.” If you remember your set theory from algebra, a null in a database is the same thing as the empty set. Just as {0} and {} are different, so are “null” and the null value.
Null values are not only permissible but necessary—I’d be pretty darn upset if you made up a middle name for me—but, when you have them, especially a lot of them, in your database, that should be a signal that you need to make certain that the database is in 1NF.
Keys
Before I can discuss further implications of 1NF, I must first discuss the topic of keys.
Elsewhere in the pedantic discussion of databases is the concept of keys and the uniqueness of rows in a database. The idea is that you can’t have duplicate entries in any table; if you did, which—for example—Ben Goren should you update when my address changes? If you only updated one, you’d now have me listed multiple times with different addresses…and which one is current?
To solve that problem, there’s the idea of keys, especially primary keys. A key is a column or combination of columns which will always be unique. The primary key is the one column or combination of columns which is used to identify a given row. Usually, there’s only one column that qualifies as a key, and therefore becomes the primary key by default.
Very often, a column is created solely for the purpose of serving as the primary key. You’re familiar with the idea of getting an order number when you place an order over the phone; you know that you can give that number back to somebody at the company and they’ll know that you mean the order you placed on Wednesday, not Monday. They’ll also know that it’s your order, not your sister’s who lives at the same address.
The table above will need such a column. A quick perusal of the phone book will reveal scores of “John Smiths.” It’s even concievable that John Smith could have a roommate also named John Smith, so you can’t use the combination of name and address as a key. You can think of some unique way of telling the two apart—even if they were born at the same time, they weren’t born in the exact same place, for example—but such information might not be available. Further, having a small number of columns as your primary key will often provide better performance than having a large number of columns.
Most database programs will have a means of identifying more than one key in a given table. This would be done when you want to enforce a requirement that those keyed columns always have unique values for every row. For example, a database version of the periodic table of the elements would have keys on the name, symbol, atomic number, etc., as those values are unique for each element. In this example, which column(s) you chose for your primary key would be part of the art of database design.
Null values are not permissible in a key column.
Keys are also used to relate information from one table to another—but more on that later.
First Normal Form, Continued
The solution to the problem of multiple addresses is to break the information into two tables:
Good
Table: People
ID* First_Name Last_Name
1 Ben Goren
2 Jane Doe
Table: Addresses
Person*->People:ID Kind* Address City State ZIP
1 Mailing Post Office Box 964 Tempe Arizona 85280-0964
1 Residence 1331 West Third Street Apartment 2 Tempe Arizona 85281
2 Mailing 1234 Main Street Mesa Arizona 85345
*Primary key
In the People table, we now have an artifical ID column that serves as the primary key. We’ll let the database program worry about creating unique values for each person; in my examples, I’ll just number them sequentially.
The Addresses table starts with what’s called a foreign key. The values in the person column have a matching value in the ID column of the People table. If you want to know the name of the person who owns the first address, you look for the person who has ID 1 in the People table; same for the sedcond address. The third address belongs to the person with ID 2.
The Addresses table uses a composite primary key. Because a person might have more than one address, the Person column isn’t enough to guarantee uniqueness. Similarly, more than one person has a mailing address, so we can’t use the Kind column. However, each person has only one of each possible kind of address, so the two columns together can function as the primary key.
Relationships
This pair of tables demonstrates what’s called a “one-to-many relationship.” Each one person may (or may not) have many addresses, but each address belongs to one and only one person. If two people lived at the same address, you’d have two lines that were otherwise identical except for the Person column.
There are two other kinds of relationships: one-to-one and many-to-many. Because each person has only one birthdate, you could create a separate table with the person’s ID (from the People table) and their birthdate. This would be a one-to-one relationship. It would be equally valid to include this information in the People table, but you might choose not to do so for logical or performance reasons (again, this is where the art comes in).
A book might have more than one author, and an author might write more than one book; this is a many-to-many relationsihp. To work with such, you must have not two, but three tables:
Good
Table: Authors
ID* First_Name Last_Name
1 John Doe
2 Jane Smith
3 Judith Brown
4 Bill Gates
Table: Books
ID* Book
1 Database Design
2 Learning Spreadsheets
3 Basic Wordprocessing
4 World Conquest
Table: Authors-Books
Author*->Authors:ID Book*->Books:ID
1 1
2 1
3 1
2 2
3 2
1 3
2 3
4 4
*Primary key
The above trio of tables says that John, Jane, and Judith all collaborated to write Database Design; Jane and Judith together wrote Learning Spreadsheets; John and Jane worked with each other on Basic Wordprocessing; and Bill wrote World Conquest all by himself.
Queries
While database design is often about breaking up information, acutally using the information means putting it back together again. To pull information from a database is to query it; the bit of computer code used to pull the information is called a query.
In a database with a pretty visual interface, such as Microsoft Access, queries can often be created simply by pointing and clicking. Most large-scale databases use a computer language called Structured Query Language, or SQL, for the task.
A simple query against a database with the addressbook tables above might look something like this:
SELECT Last_Name, First_Name FROM People
The result might look like this:
Goren Ben
Doe Jane
The power of relational databases comes in joining two tables together in a query. The following bit of code lists every address for every person in the database and prints the results in alphabetical order by last name:
SELECT People.First_Name, People.Last_Name, Addresses.Address, Addresses.City, Addresses.State, Addresses.ZIP FROM People, Addresses WHERE People.ID = Addresses.Person ORDER BY People.Last_Name
The result will look like the following:
Ben Goren Post Office Box 964 Tempe Arizona 85280-0964
Ben Goren 1331 West Third Street Apartment 2 Tempe Arizona 85281
Jane Doe 1234 Main Street Mesa Arizona 85345
Notice the two lines for me? This isn’t in any normal form, but that’s okay—this is what the database is generating, not what the database is storing. This is exactly the way you want things to work: the database stores the information in an efficient manner (multiple tables), but, for humans, it presents it to you on demand in any manner you want, including redunant ones.
Much of SQL is beyond the scope of this document; it can do amazing things, but you don’t need to know how to do all those amazing things to understand how a database works. I’ll leave you with one last example, which will select only those books that Judith Brown wrote in the above author/books example:
SELECT Authors.First_Name, Authors.Last_Name, Books.Book FROM Authors, Books WHERE Authors.First_Name = "Judith" AND Authors.Last_Name = "Brown" AND Authors.ID = Authors-Books.Author
The result:
Judith Brown Database Design
Judith Brown Learning Spreadsheets
Second Normal Form
(definition asusming only one candidate key, which is thus the primary key): A relation is in 2NF if and only if it is in 1NF and every nonkey attribute is irreducibly dependent on the primary key.
Once you have your database in 1NF, you’ve finished the hard part. The other normal forms are important, but not as easy to break or address problems that aren’t as common. An important thing to remember: if the database isn’t in 1NF, it can’t, no way no how, be in 2NF either. A database that is in a higher normal form is also in all of the lower normal forms.
The point of 2NF is to avoid the following:
Bad
ID First_Name Last_Name Spouses_Birthday
3 John Smith 02/07/65
4 Sue Jones 07/26/76
The problem is that the last column relates not to John Smith, but to another person entirely. John has only one spouse, who only has one birthday…but that’s too tenuous a relationship to embed into a single table. Rather, you would want to structure your tables as follows:
Good
Table: People
ID* First_Name Last_Name Birthday
3 John Smith 08/01/62
4 Sue Jones 10/12/75
5 Mary Smith 02/07/65
Table: Spouses
Husband*->People:ID Wife*->People:ID
3 5
*Primary key
This arrangement allows you to know not only John’s wife’s birthday, but everything else you might want to know about her. If they should get divorced or if they re-marry, you only have to change the Spouses table to reflect the changes.
As a side note: the Spouses table, as it’s set up, permits polygamy and polyandry. There are many societies around the globe where this happens regularly, so you might want to leave things this way. If you’re sure that nobody in the database will have more than one spouse, you should specify that each column is a candidate key; that way, only one person can appear in each column. Further constraints and logic could ensure that only men got listed as husbands and women as wives, but that starts to get complex….
Third Normal Form
(definition asusming only one candidate key, which is thus the primary key): A relation is in 3NF if and only if it is in 2NF and every nonkey attribute is nontransitively dependent on the primary key.
3NF takes the process started in 2NF to its logical conclusion. Columns in a table should have information that is intrinsically a part of the nature of the primary key.
Consider, for example, a company’s ordering and shipping database. If only complete orders ever get shipped, you’d want to include the shipment date with the orders table. If partial orders can be shipped as items arrive, the shipment date belongs with the item detais table. Which of the two choices puts the database in 3NF depends on the underlying business model.
Fourth Normal Form
Relation R is in 4NF if and only if, whenever there exist subsets A and B of the attributes of R such that the (nontrivial) MVD A->>B is satisfied, then all attributes of R are also functionally dependent on A.
The process of reducing interdependencies between columns (not tables) continues with 4NF. Assuming every teacher uses the same books when teaching the same course, the following table is not in 4NF:
Bad
Course Teacher Textbook
Physics Jones Basic Mechanics
Physics Jones Principles of Optics
Physics Smith Basic Mechanics
Physics Smith Principles of Optics
Math Jones Basic Mechanics
Math Jones Vector Analysis
Math Jones Trigonometry
Looking at it intuitively, there’s a lot of duplication of information. If the school decided to drop Basic Mechanics in favor of Introduction to Mechanics, you’d have a lot of work to update everything.
The solution, again, is to break the table in two:
Good
Table: Course-Teacher
Course Teacher
Physics Jones
Physics Smith
Math Jones
Table: Course-Text
Course Text
Physics Basic Mechanics
Physics Principles of Optics
Math Basic Mechanics
Math Vector Analysis
Math Trigonometry
Naturally, you’d be using foreign keys for most everything—but I’m sure you don’t want to be overwhelmed with an entirely new, large database for one example.
Breaking up the table allows you to worry just about which teacher is teaching which coures, and which books are available to each course.
If different teachers use different books when teaching different courses, then the underlying assumption (a few paragraphs ago) is invalid, and so is this solution. It all depends on what’s going on in the “real world.”
Fifth Normal Form
A relation R is in 5NF—also called projection-join normal form (PJ/NF)—if and only if every join dependency in R is implied by the candidate keys of R.
You might notice that the solution to almost every normalization problem involves breaking up the offending table into two or more parts. The logical conclusion to this process is to break everything up into the smallest possible parts—in other words, you’d have tables with nothing other than primary keys and foreign keys. (That’s an oversimplification, but not much of one.) While 5NF solves nearly all theoretical problems, it introduces many practical ones. It’s good to know that it’s an option to solving problems, but you should think very carefully before putting an entire database in 5NF. Sometimes, 5NF is appropriate for portions of a database…but usually only for those who really know what they’re doing.
Le Fin
While 5NF may be an extreme, it does suggest a good rule of thumb to ensure your database doesn’t suffer from these kinds of problems: “When in doubt, break it up.”
If your database isn’t in 1NF—and almost every database I’ve seen an amateur design hasn’t been—you’re in for plenty of trouble and headaches. You’re likely to spend as much time maintaining the database as you will actually using it. A database that isn’t in 1NF might actually be worse than a card catalog: my experience has shown me that people who don’t know enough about database design to at least put the database in 1NF also don’t know how to back up and archive data; if you don’t keep good backups, your computer data is much more fragile and prone to loss and corruption than paper records kept in a labeled filing cabinet.
Once you get over the hurdle of 1NF, using the same basic kind of intuitive thinking is likely to help you put the database in 4NF with relative ease. With many simple databases, the process of putting them in 1NF incidentally also puts them in 4NF—but don’t rely on that generalization to think you don’t need to use a critical eye before declaring a database design finished.
For that matter, very few databases are ever done in their design. The world changes, and so does the data about it. Assumptions that once were valid are now invalid; things we take for granted were impossible a generation ago. After the database designer is done, the database administrator must keep a keen watch over the database and make sure that everything keeps working smoothly, tweaking things as time passes.
Tuesday, May 4, 2010
NormalForm in DBMS
A Simple Guide to Five Normal Forms in Relational Database Theory
> 1 INTRODUCTION . . . 2
> 2 FIRST NORMAL FORM . . . 2
> 3 SECOND AND THIRD NORMAL FORMS . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
> 4 FOURTH AND FIFTH NORMAL FORMS . . . 5
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 5 UNAVOIDABLE REDUNDANCIES . . . 12
> 6 INTER-RECORD REDUNDANCY . . . 13
> 7 CONCLUSION . . . 13
> 8 ACKNOWLEDGMENT . . . 14
> 9 REFERENCES . . . 14
1 INTRODUCTION
The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].
The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.
2 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields.
First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.
3.1 Second Normal Form
Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:
The warehouse address is repeated in every record that refers to a part stored in that warehouse.
If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:
------------------------------- ---------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
====================----------- =============--------------------
When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.
The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:
The department's location is repeated in the record of every employee assigned to that department.
If the location of the department changes, every such record must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
If a department has no employees, there may be no record in which to keep the department's location.
To satisfy third normal form, the record shown above should be decomposed into the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.
3.3 Functional Dependencies
In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.
There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:
----------------------------------------------
| PERSON | ADDRESS |
-------------+--------------------------------
| John Smith | 123 Main St., New York |
| John Smith | 321 Center St., San Francisco |
----------------------------------------------
Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.
Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.
---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------
We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.
For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.
----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------
However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence no violation of third normal form.
4 FOURTH AND FIFTH NORMAL FORMS
Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.
Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.
4.1 Fourth Normal Form
Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.
The term "independent" will be discussed after considering an example.
Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.
The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:
(1) A disjoint format, in which a record contains either a skill or a language, but not both:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)
(2) A random mix, with three variations:
(a) Minimal number of records, with repetitions:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
(b) Minimal number of records, with null values:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
(c) Unrestricted:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:
If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.
Fourth normal form minimizes such update problems.
4.1.1 Independence
We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.
In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.
4.1.2 Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.
We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:
An employee uses certain skills on certain projects.
An employee uses certain languages on certain projects.
If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.
4.2 Fifth Normal Form
Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.
We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.
But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------
In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- | GM | truck | -------------------
---------------------
These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.
Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.
Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.
One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.
It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.
It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.
The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------
Observe that:
Jones sells cars and GM makes cars, but Jones does not represent GM.
Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
Brown represents Ford and Brown sells buses, but Ford does not make buses.
Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.
In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.
5 UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.
6 INTER-RECORD REDUNDANCY
The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundanciesæ.
Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------
In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.
Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].
7 CONCLUSION
While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:
Single-valued vs. multi-valued facts.
Dependency on the entire key.
Independent vs. dependent facts.
The presence of mutual constraints.
The presence of non-unique or non-singular representations.
And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.
8 ACKNOWLEDGMENTS
I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.
> 1 INTRODUCTION . . . 2
> 2 FIRST NORMAL FORM . . . 2
> 3 SECOND AND THIRD NORMAL FORMS . . . 2
>> 3.1 Second Normal Form . . . 2
>> 3.2 Third Normal Form . . . 3
>> 3.3 Functional Dependencies . . . 4
> 4 FOURTH AND FIFTH NORMAL FORMS . . . 5
>> 4.1 Fourth Normal Form . . . 6
>>> 4.1.1 Independence . . . 8
>>> 4.1.2 Multivalued Dependencies . . . 9
>> 4.2 Fifth Normal Form . . . 9
> 5 UNAVOIDABLE REDUNDANCIES . . . 12
> 6 INTER-RECORD REDUNDANCY . . . 13
> 7 CONCLUSION . . . 13
> 8 ACKNOWLEDGMENT . . . 14
> 9 REFERENCES . . . 14
1 INTRODUCTION
The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date [4].
The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.
2 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields.
First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.
3.1 Second Normal Form
Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:
---------------------------------------------------
| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |
====================-------------------------------
The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:
The warehouse address is repeated in every record that refers to a part stored in that warehouse.
If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.
If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.
To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:
------------------------------- ---------------------------------
| PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |
====================----------- =============--------------------
When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third.
The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:
The department's location is repeated in the record of every employee assigned to that department.
If the location of the department changes, every such record must be updated.
Because of the redundancy, the data might become inconsistent, with different records showing different locations for the same department.
If a department has no employees, there may be no record in which to keep the department's location.
To satisfy third normal form, the record shown above should be decomposed into the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.
3.3 Functional Dependencies
In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.
There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:
----------------------------------------------
| PERSON | ADDRESS |
-------------+--------------------------------
| John Smith | 123 Main St., New York |
| John Smith | 321 Center St., San Francisco |
----------------------------------------------
Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.
Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.
---------------------------------------
| PERSON | ADDRESS |
-------------+-------------------------
| John Smith | 123 Main St., New York |
| John Smith | 123 Main Street, NYC |
---------------------------------------
We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.
For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.
----------------------------------------------------------
| EMPLOYEE | FATHER | FATHER'S-ADDRESS |
|============------------+-------------------------------|
| Art Smith | John Smith | 123 Main St., New York |
| Bob Smith | John Smith | 123 Main Street, NYC |
| Cal Smith | John Smith | 321 Center St., San Francisco |
----------------------------------------------------------
However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence no violation of third normal form.
4 FOURTH AND FIFTH NORMAL FORMS
Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.
Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.
4.1 Fourth Normal Form
Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.
The term "independent" will be discussed after considering an example.
Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.
The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:
(1) A disjoint format, in which a record contains either a skill or a language, but not both:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | |
| Smith | type | |
| Smith | | French |
| Smith | | German |
| Smith | | Greek |
-------------------------------
This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)
(2) A random mix, with three variations:
(a) Minimal number of records, with repetitions:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
(b) Minimal number of records, with null values:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | German |
| Smith | | Greek |
-------------------------------
(c) Unrestricted:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | |
| Smith | | German |
| Smith | type | Greek |
-------------------------------
(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | cook | German |
| Smith | cook | Greek |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:
If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.
Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.
Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.
Fourth normal form minimizes such update problems.
4.1.1 Independence
We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee:skill and employee:language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.
In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
|----------+-------+----------|
| Smith | cook | French |
| Smith | type | French |
| Smith | type | German |
| Smith | type | Greek |
-------------------------------
Thus the employee:skill and employee:language relationships are no longer independent. These records do not violate fourth normal form. When there is an interdependence among the relationships, then it is acceptable to represent them in a single record.
4.1.2 Multivalued Dependencies
For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.
We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:
An employee uses certain skills on certain projects.
An employee uses certain languages on certain projects.
If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.
4.2 Fifth Normal Form
Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.
We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | GM | truck |
-----------------------------
This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.
But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
-----------------------------
In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car |
| Smith | GM | | Ford | truck | | Smith | truck |
| Jones | Ford | | GM | car | | Jones | car |
------------------- | GM | truck | -------------------
---------------------
These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.
Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.
Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.
One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the unnormalized form it may be repeated many times.
It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the unnormalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the unnormalized form.
It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.
The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.
-----------------------------
| AGENT | COMPANY | PRODUCT |
|-------+---------+---------|
| Smith | Ford | car |
| Smith | Ford | truck |
| Smith | GM | car |
| Smith | GM | truck |
| Jones | Ford | car |
| Jones | Ford | truck |
| Brown | Ford | car |
| Brown | GM | car |
| Brown | Totota | car |
| Brown | Totota | bus |
-----------------------------
------------------- --------------------- -------------------
| AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT |
|-------+---------| |---------+---------| |-------+---------|
| Smith | Ford | | Ford | car | | Smith | car | Fifth
| Smith | GM | | Ford | truck | | Smith | truck | Normal
| Jones | Ford | | GM | car | | Jones | car | Form
| Brown | Ford | | GM | truck | | Jones | truck |
| Brown | GM | | Toyota | car | | Brown | car |
| Brown | Toyota | | Toyota | bus | | Brown | bus |
------------------- --------------------- -------------------
Observe that:
Jones sells cars and GM makes cars, but Jones does not represent GM.
Brown represents Ford and Ford makes trucks, but Brown does not sell trucks.
Brown represents Ford and Brown sells buses, but Ford does not make buses.
Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pairwise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.
In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.
5 UNAVOIDABLE REDUNDANCIES
Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown Section 4.1.1, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.
6 INTER-RECORD REDUNDANCY
The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundanciesæ.
Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
-----------------------
| EMPLOYEE | LOCATION |
============-----------
In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.
Inter-record redundancy has been recognized for some time [1], and has recently been addressed in terms of normal forms and normalization [8].
7 CONCLUSION
While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:
Single-valued vs. multi-valued facts.
Dependency on the entire key.
Independent vs. dependent facts.
The presence of mutual constraints.
The presence of non-unique or non-singular representations.
And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.
8 ACKNOWLEDGMENTS
I am very grateful to Ted Codd and Ron Fagin for reading earlier drafts and making valuable comments, and especially to Chris Date for helping clarify some key points.
Wednesday, April 28, 2010
Russell's Paradox
Russell's Paradox
© Copyright 2000, Jim Loy
Let you tell me a famous story:
There was once a barber. Some say that he lived in Seville. Wherever he lived, all of the men in this town either shaved themselves or were shaved by the barber. And the barber only shaved the men who did not shave themselves.That is a nice story. But it raises the question: Did the barber shave himself? Let's say that he did shave himself. But we see from the story that he shaved only the men in town who did not shave themselves. Therefore, he did not shave himself. But we again see in the story that every man in town either shaved himself or was shaved by the barber. So he did shave himself. We have a contradiction. What does that mean?
Maybe it means that the barber lived outside of town. That would be a loophole, except that the story says that he did live in the town, maybe in Seville. Maybe it means that the barber was a woman. Another loophole, except that the story calls the barber "he." So that doesn't work. Maybe there were men who neither shaved themselves nor were shaved by the barber. Nope, the story says, "All of the men in this town either shaved themselves or were shaved by the barber." Maybe there were men who shaved themselves AND were shaved by the barber. After all, "either ... or" is a little ambiguous. But the story goes on to say, "The barber only shaved the men who did not shave themselves." So that doesn't work either. Often, when the above story is told, one of these last two loopholes is left open. So I had to be careful, when I wrote down the story.
Now we come to a really serious attempt to solve the above puzzle: Maybe there was no barber like the one described in the story. But the story said, "There was once a barber..." So there really was a barber like that, unless the story is a lie! That is the answer, isn't it? The story is a lie. Sorry about that. I told the story of a barber who could not possibly exist. I had good motives. But I guess I told a lie.
In logic, some statements are true (Jim is nearsighted), some are false (Jim eats squash). And a collection of statements, such as our story, is either consistent or inconsistent. The following pair of statements is inconsistent:
- Jim likes vanilla ice cream with Smuckers Plum Jam on it!
- Jim does not like vanilla ice cream with Smuckers Plum Jam on it.
The above story about the barber is the popular version of Russell's Paradox. The story was originally told by Bertrand Russell. And of course it has a simple solution. It is inconsistent. But the story is not really that simple. The story is a retelling of a problem in set theory.
In set theory, we have sets, collections of objects. These objects may be real physical objects (marbles) or not (cartoon characters, thoughts, or numbers). When we deal with a set, we normally write it down with brackets: {A, B, C}. That set contains three letters, A, B, and C. The set {B,C} is a subset of {A, B, C}. There is a special set with no elements, the empty set {} or ø, as the set of humans bigger than the earth, or the set of odd numbers divisible by two. Some sets contain infinitely many elements, as the set of all even numbers.
A set can contain sets. The set {{A, B, C}, {x, y}} contains two sets {A, B, C} and {x, y}. It also contains the empty set, by the way. All sets contain the empty set. We can define the set of all sets. This set contains {A, B, C} and {{A, B, C}, {x, y}} and every other possible set. Some sets contain themselves. The set of all red marbles does not contain itself, because it contains no sets at all, only marbles. Let's say that S is a set which contains S and {A, B}. Then this is S: {S, {A, B}}. It contains two sets, itself and {A, B}. The set of all sets obviously contains itself. Well, let's construct a very interesting set, the set of all sets which do not contain themselves.
There is something wrong here. Does "the set of all sets which do not contain themselves" sound like "the barber who shaves all men who do not shave themselves?" The story of the barber was inconsistent. The set of all sets which do not contain themselves is inconsistent for the same reason. Does the set of all sets which do not contain themselves actually contain itself, or not? If it contains itself, then it cannot contain itself. If it does not contain itself, then it must contain itself. It is inconsistent.
But where did we go wrong? Let's make some lists. A list is a special kind of set. But we know what a list is. A list may be clearer in our minds than a set. We cannot actually physically make infinite lists. But we can certainly define some of them, like the list of all even numbers. So we can deal with infinite lists. We can also make lists of lists. Here is such a list:
- My shopping list
- My email address list
- David Letterman's list of Top Ten Whatevers
Well, maybe there are no infinite lists. There are infinite sets, for example, the set of all even numbers. And that is a list: the list of all even numbers. The concept of an infinite list is actually fairly simple.
So we have an inconsistent set. That is not all. We made no mistakes when we constructed the set of all sets which do not contain themselves. And that means that set theory is inconsistent. And that means that logic is inconsistent. And that means that all of mathematics, including algebra and geometry, is inconsistent.
It doesn't invalidate mathematics or logic or set theory. The Pythagorean theorem is still true. But there is some doubt. Kurt Godel (Gödel) proved that Number Theory (and by identical arguments, every branch of mathematics) is inconsistent. He converted Russell's Paradox, the set version, into a statement in Number Theory, and showed that Number Theory is inconsistent. This had huge repercussions in the world of mathematics. All of this leads to the following problem:
- There are things that are true in mathematics (based on basic assumptions).
- There are things that are false.
- There are things that are true that can never be proved.
- There are things that are false that can never be disproved.
Monday, April 26, 2010
UGC NET
MODEL UGC -NET PAPER I
This sample paper in for Paper I of the UGC NET Exam which is common for all streams.1. Which one of the following is the main objective of teaching?
(A) To give information related to the syllabus.
(B) To develop thinking power of students.
(C) To dictate notes to students.
(D) To prepare students to pass the examination.
2. Which one of the following is a good method of teaching?
(A) Lecture and Dictation
(B) Seminar and Project
(C) Seminar and Dictation
(D) Dictation and Assignment
3. Teacher uses teaching aids for
(A) Making teaching interesting
(B) Making teaching within understanding level of students
(C) Making students attentive.
(D) The sake of its use.
4. Effectiveness of teaching depends on
(A) Qualification of teacher
(B) Personality of teacher
(C) Handwriting of teacher
(D) Subject understanding of teacher
5. Which of the following is not characteristic of a good question paper?
(A) Objectivity
(B) Subjectivity
(C) No use of vague words
(D) Reliable.
6. A researcher is generally expected to:
(A) Study the existing literature in a field
(B) Generate new principles and theories
(C) Synthesize the idea given by others
(D) Evaluate the findings of a study
7. One of the essential characteristics of research is:
(A) Replicability
(B) Generalizability
(C) Usability
(D) Objectivity
8. The Government of India conducts Census after every 10 years. The method of research used in this process is:
(A) Case Study
(B) Developmental
(C) Survey
(D) Experimental
9. An academic association assembled at one place to discuss the progress of its work and future plans. Such an assembly is known as a
(A) Conference
(B) Seminar
(C) Workshop
(D) Symposium
10. An investigator studied the census date for a given area and prepared a write-up based on them. Such a write-up is called
(A) Research paper
(B) Article
(C) Thesis
(D) Research report
Read the following passage and answer the Question Nos. 11 to 15
The constitution guarantees every citizen the fundamental right to equality. Yet after 50 years of independence, just one perusal of the female infant mortality figures, the literacy rates and the employment opportunities for women is sufficient evidence that discrimination exists. Almost predictably, this gender, bias is evident in our political system as well. In the 13th Lok Sabha, there were only 43 women MPs out of total of 543; it is not a surprising figure, for never has women's representation in Parliament been more than 10 per cent.
Historically, the manifestos of major political have always encouraged women's participation. It has been merely a charade. So, women's organizations, denied a place on merit, opted for the last resort; a reservation of seats for women in parliament and State Assemblies. Parties, which look at everything with a vote bank in mind, seemed to endorse this. Alas, this too was a mirage.
But there is another aspect also. At a time when caste is the trump card, some politicians want the bill to include further quotas fro women from among minorities and backward castes. There is more to it. A survey shows that there is a general antipathy towards the bill. It is actually a classic case of doublespeak: in public, politicians were endorsing women's reservation but in the backrooms of Parliament, they were busy sabotaging it. The reasons are clear: Men just don't want to vacate their seats of power.
11. The problem raised in the passage reflects badly on our
(A) Political system
(B) Social behaviour
(C) Individual behaviour
(D) Behaviour of a group of people
12. According to the passage, political parties have mostly in mind
(A) Economic prosperity
(B) Vote bank
(C) People' welfare
(D) Patriotism
13. "Trump Card" means
(A) Trying to move a dead horse
(B) Playing the card cautiously
(C) Sabotaging all the moves by others
(D) Making the final jolt for success
14. The sentence "Men just don't want to vacate their seats of power" implies
(A) Lust for power
(B) Desire to serve the nation
(C) Conviction in one's own political abilities
(D) Political corruption
15. What is the percentage of women in the Lok Sabha
(A) 10
(B) 7. 91
(C) 43
(D) 9. 1
16. Informal communication network within the organization is knows as
(A) Interpersonal communication
(B) Intrapersonal Communication
(C) Mass Communication
(D) Grapevine Communication
17. TV Channel launched fro covering only Engineering and Technology subject is known as
(A) Gyan Darshan
(B) Vyas
(C) Eklavya
(D) Kisan
18. In which state the maximum number of periodicals are brought out for public information:
(A) Uttar Pradesh
(B) Tamil Nadu
(C) Kerala
(D) Punjab
19. The main objective of public broadcasting system i. e Prasar Bharti is
(A) Inform, Entertainment & Education
(B) Entertain, Information & Interaction
(C) Educate, Interact & entertain
(D) Entertainment only
20. The competerrcy of an effective communicator can be judged on the basis of:
(A) Personality of communicator
(B) Experience in the field
(C) Interactivity with target audience
(D) Meeting the needs of target audience.
21. Which one of the following belongs to the category of homogeneous date:
(A) Multi-storeyed houses in a colony
(B) Trees in a garden
(C) Vehicular traffic on a highway
(D) Student population in a class
22. In which of the following ways a theory is not different from a belief?
(A) Antecedent - consequent
(B) Acceptability
(C) Verifiability
(D) Demonstratability
23. The state - "Honesty is the best policy" is
(A) A fact
(B) An value
(C) An opinion
(D) A value judgement
24. Which one is like pillar, pole and standard?
(A) Beam
(B) Plank
(C) Shaft
(D) Timber
25. Following incomplete series is presented. Find out the number which should come at the place of question mark which will complete the series: 4, 16, 36, 64, ?
(A) 300
(B) 200
(C) 100
(D) 150
26. The following question is based on the diagram given below. If the two big circles represent animals living on soil and those living in water, and the small circle stands for the animals who both live on soil and in water, which figure represents the relationships among them.
27. Of the following statement, there are two statements both of which cannot be true but both can be false. Which are these two statements?
(i) All machines make noise
(ii) Some machines are noisy
(iii) No machine makes noise
(iv) Some machines are not noisy
(A) (i) and (ii)
(B) (iii) and (iv)
(C) (i) and (iii)
(D) (ii) and (iv)
28. In the following question a statement is followed by two assumptions.
(i) and (ii) . An assumption is something supposed or taken for granted.
Consider the statement and the following assumptions and decide which of the following assumptions is implicit in the statement.
Statement: We need not worry about errors but must try to learn from our errors.
Assumptions:
(i) Errors may take place when we are carrying out certain work.
(ii) We are capable of benefiting from the past and improve our chances of error-free work.
(A) Only assumption (i) is implicit
(B) Only assumption (ii) is implicit
(C) Either assumption (i) or (ii) is implicit
(D) Both the assumptions are implicit
29. The question below is followed by two arguments numbered (i) and (ii) Decide which of the arguments is 'strong' and which is 'weak'. Choose the correct answer from the given below Should the press exercise some self-restraint?
(i) Yes, they should not publish new items which may incite the readers to indulge in wrong practices.
(ii) No. it is the responsibility of the press to present the truth irrespective of the consequences.
(A) Only the argument (i) is strong
(B) Only the argument (ii) is strong
(C) Neither argument (i) nor (ii) is strong
(D) Both the arguments (i) and (ii) are strong
30. Study the argument and the inference drawn from that argument. Given below carefully.
Argument: Anything that goes up definitely falls down. Helicopter goes up. Inference: So the helicopter will definitely fall down.
What in your opinion is the inference drawn from the argument?
(A) Valid
(B) Invalid
(C) Doubtful
(D) Long drawn one
Four students W, X, Y, Z appeared in four papers, I, II, III and IV in a test. Their scores out of 100 are given below.
Students | Papers | |||
I | II | III | IV | |
W | 60 | 81 | 45 | 55 |
X | 59 | 43 | 51 | A |
Y | 74 | A | 71 | 65 |
Z | 72 | 76 | A | 68 |
Where 'A' stands for absent |
Read the above table and answer below mentioned Questions 31 to 35
31. Which candidate has secured between 60-65% marks in aggregate
(A) W
(B) X
(C) Y
(D) Z
32. Who has obtained the lowest average in aggregate.
(A) W
(B) X
(C) Y
(D) Z
33. Who has obtained the highest average
(A) W
(B) X
(C) Y
(D) Z
34. In which paper the lowest marks were obtained by thecandiates
(A) I
(B) II
(C) III
(D) IV
35. Which candidate has secured the highest percentage in the papers appeared
(A) W
(B) X
(C) Y
(D) Z
36. ICT stands for
(A) Information common technology
(B) Information & communication technology
(C) Information and computer technology
(D) Inter connected technology
37. Computer Can
(A) Process both quantitative and qualitative information
(B) Store huge information
(C) Process information and fast accurately
(D) All the above.
38. Satellite Communication works through
(A) Rader
(B) Transponder
(C) Receptor
(D) Transmitter
39. A Computer is that machine which works more like a human brain. This definition of computer is
(A) Correct
(B) Incorrect
(C) Partially correct
(D) None of the above.
40. Information and communication technology includes
(A) E-mail
(B) Internet
(C) Education television
(D) All the above.
41. It is believed that our globe is warming progressively. This global warming will eventually result in.
(A) Increase in availability of usable land.
(B) Uniformity of climate at equator and poles.
(C) Fall in the sea level
(D) melting of polar ice.
42. In which parts of India ground water is affected with arsenic contamination?
(A) Haryana
(B) Andhra Pradesh
(C) Sikkim
(D) West Bengal
43. Sunderban in Hooghly delta is known for
(A) Grasslands
(B) Conifers
(C) Mangroves
(D) Arid forests
44. Sardar Sarover dam is located on the river
(A) Ganga
(B) Godavari
(C) Mahanadi
(D) Narmada
45. Which one of the following trees has medicinal value?
(A) Pine
(B) Teak
(C) Neem
(D) Oak
46. Which one of the following is not considered a part of technical education in India:
(A) Medical
(B) Management
(C) Pharmaceutical
(D) Aeronautical
47. Which of the following is a Central university
(A) Mumbai University
(B) Calcutta University
(C) Delhi University
(D) Madras University
48. Identify the main Principle on which the Parliamentary System Operates
(A) Responsibility of Executive to Legislature
(B) Supremacy of Parliament
(C) Supremacy of Judiciary
(D) Theory of Separation of Power
49. The reservation of seats for women in the Panchayat Raj Institutions is:
(A) 30 % of the total seats
(B) 33 % of the total seate
(C) 33% of the total population
(D) In Proportion to their population
50. Match list I with list II and select the correct answer from the code given below:
LIST ( Institutions)
LIST II( Locations)
1. Indian Veterinary Research Institute
(i) Pune
2. Institute of Armament Technology
(ii) Izat Nagar
3. Indian Institute of Science
(iii) Delhi
4. National Institute for Educational Pannesi
(iv) Bangalore and Administrators
(A) 1(ii), 2(i), 3(iv), 4(iii)
(B) 1(ii), 2(iv), 3(ii), 4(iii)
(C) 1(ii), 2(iii), 3(i), 4(iv)
(D) 1(iv), 2(iii), 3(ii), 4(i)
Source: Sample Paper based on questions provided by UGC Model Paper.
Answer Key:
1. | B | 2. | B | 3. | B | 4. | D | 5. | B | 6. | B | 7. | D | 8. | C | 9. | A | 10. | B |
11. | B | 12. | B | 13. | D | 14. | A | 15. | B | 16. | D | 17. | C | 18. | C | 19. | A | 20. | D |
21. | A | 22. | B | 23. | D | 24. | A | 25. | A | 26. | D | 27. | C | 28. | D | 29. | A | 30. | D |
31. | A | 32. | B | 33. | A | 34. | B | 35. | D | 36. | B | 37. | D | 38. | B | 39. | A | 40. | D |
41. | D | 42. | D | 43. | C | 44. | D | 45. | C | 46. | A | 47. | C | 48. | A | 49. | B | 50. | A |
Thursday, April 22, 2010
Deadlock(in brief)
| |
Operating System Question and answer
Q: What are the basic functions of an operating system?
A: Operating system controls and coordinates the use of the hardware among the various applications programs for various uses. Operating system acts as resource allocator and manager. Since there are many possibly conflicting requests for resources the operating system must decide which requests are allocated resources to operating the computer system efficiently and fairly. Also operating system is control program which controls the user programs to prevent errors and improper use of the computer. It is especially concerned with the operation and control of I/O devices.
Q: Why paging is used?
A: Paging is solution to external fragmentation problem which is to permit the logical address space of a process to be noncontiguous, thus allowing a process to be allocating physical memory wherever the latter is available.
While running DOS on a PC, which command would be used to duplicate the entire diskette? diskcopy
Q: What resources are used when a thread created? How do they differ from those when a process is created?
A: When a thread is created the threads does not require any new resources to execute the thread shares the resources like memory of the process to which they belong to. The benefit of code sharing is that it allows an application to have several different threads of activity all within the same address space. Whereas if a new process creation is very heavyweight because it always requires new address space to be created and even if they share the memory then the inter process communication is expensive when compared to the communication between the threads.
Q: What is virtual memory?
A: Virtual memory is hardware technique where the system appears to have more memory that it actually does. This is done by time-sharing, the physical memory and storage parts of the memory one disk when they are not actively being used.
Q: What is Throughput, Turnaround time, waiting time and Response time?
A: Throughput – number of processes that complete their execution per time unit. Turnaround time – amount of time to execute a particular process. Waiting time – amount of time a process has been waiting in the ready queue. Response time – amount of time it takes from when a request was submitted until the first response is produced, not output (for time-sharing environment).
Q: What is the state of the processor, when a process is waiting for some event to occur?
A: Waiting state
Q: What is the important aspect of a real-time system or Mission Critical Systems?
A: A real time operating system has well defined fixed time constraints. Process must be done within the defined constraints or the system will fail. An example is the operating system for a flight control computer or an advanced jet airplane. Often used as a control device in a dedicated application such as controlling scientific experiments, medical imaging systems, industrial control systems, and some display systems. Real-Time systems may be either hard or soft real-time. Hard real-time: Secondary storage limited or absent, data stored in short term memory, or read-only memory (ROM), Conflicts with time-sharing systems, not supported by general-purpose operating systems. Soft real-time: Limited utility in industrial control of robotics, Useful in applications (multimedia, virtual reality) requiring advanced operating-system features.
Q: What is the difference between Hard and Soft real-time systems?
A: A hard real-time system guarantees that critical tasks complete on time. This goal requires that all delays in the system be bounded from the retrieval of the stored data to the time that it takes the operating system to finish any request made of it. A soft real time system where a critical real-time task gets priority over other tasks and retains that priority until it completes. As in hard real time systems kernel delays need to be bounded
Q: What is the cause of thrashing? How does the system detect thrashing? Once it detects thrashing, what can the system do to eliminate this problem?
A: Thrashing is caused by under allocation of the minimum number of pages required by a process, forcing it to continuously page fault. The system can detect thrashing by evaluating the level of CPU utilization as compared to the level of multiprogramming. It can be eliminated by reducing the level of multiprogramming.
Q: What is multi tasking, multi programming, multi threading?
A: Multi programming: Multiprogramming is the technique of running several programs at a time using timesharing. It allows a computer to do several things at the same time. Multiprogramming creates logical parallelism. The concept of multiprogramming is that the operating system keeps several jobs in memory simultaneously. The operating system selects a job from the job pool and starts executing a job, when that job needs to wait for any i/o operations the CPU is switched to another job. So the main idea here is that the CPU is never idle.
Multi tasking: Multitasking is the logical extension of multiprogramming .The concept of multitasking is quite similar to multiprogramming but difference is that the switching between jobs occurs so frequently that the users can interact with each program while it is running. This concept is also known as time-sharing systems. A time-shared operating system uses CPU scheduling and multiprogramming to provide each user with a small portion of time-shared system.
Multi threading: An application typically is implemented as a separate process with several threads of control. In some situations a single application may be required to perform several similar tasks for example a web server accepts client requests for web pages, images, sound, and so forth. A busy web server may have several of clients concurrently accessing it. If the web server ran as a traditional single-threaded process, it would be able to service only one client at a time. The amount of time that a client might have to wait for its request to be serviced could be enormous. So it is efficient to have one process that contains multiple threads to serve the same purpose. This approach would multithread the web-server process, the server would create a separate thread that would listen for client requests when a request was made rather than creating another process it would create another thread to service the request. To get the advantages like responsiveness, Resource sharing economy and utilization of multiprocessor architectures multithreading concept can be used.
Q: What is hard disk and what is its purpose?
A: Hard disk is the secondary storage device, which holds the data in bulk, and it holds the data on the magnetic medium of the disk.Hard disks have a hard platter that holds the magnetic medium, the magnetic medium can be easily erased and rewritten, and a typical desktop machine will have a hard disk with a capacity of between 10 and 40 gigabytes. Data is stored onto the disk in the form of files.
Q: What is fragmentation? Different types of fragmentation?
A: Fragmentation occurs in a dynamic memory allocation system when many of the free blocks are too small to satisfy any request.
External Fragmentation: External Fragmentation happens when a dynamic memory allocation algorithm allocates some memory and a small piece is left over that cannot be effectively used. If too much external fragmentation occurs, the amount of usable memory is drastically reduced. Total memory space exists to satisfy a request, but it is not contiguous.
Internal Fragmentation: Internal fragmentation is the space wasted inside of allocated memory blocks because of restriction on the allowed sizes of allocated blocks. Allocated memory may be slightly larger than requested memory; this size difference is memory internal to a partition, but not being used
Q: What is DRAM? In which form does it store data?
A: DRAM is not the best, but it’s cheap, does the job, and is available almost everywhere you look. DRAM data resides in a cell made of a capacitor and a transistor. The capacitor tends to lose data unless it’s recharged every couple of milliseconds, and this recharging tends to slow down the performance of DRAM compared to speedier RAM types.
Q: What is Dispatcher?
A: Dispatcher module gives control of the CPU to the process selected by the short-term scheduler; this involves: Switching context, Switching to user mode, Jumping to the proper location in the user program to restart that program, dispatch latency – time it takes for the dispatcher to stop one process and start another running.
Q: What is CPU Scheduler?
A: Selects from among the processes in memory that are ready to execute, and allocates the CPU to one of them. CPU scheduling decisions may take place when a process: 1.Switches from running to waiting state. 2.Switches from running to ready state. 3.Switches from waiting to ready. 4.Terminates. Scheduling under 1 and 4 is non-preemptive. All other scheduling is preemptive.
Q: What is Context Switch?
A: Switching the CPU to another process requires saving the state of the old process and loading the saved state for the new process. This task is known as a context switch. Context-switch time is pure overhead, because the system does no useful work while switching. Its speed varies from machine to machine, depending on the memory speed, the number of registers which must be copied, the existed of special instructions(such as a single instruction to load or store all registers).
Q: What is cache memory?
A: Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM. As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory.
Q: What is a Safe State and what is its use in deadlock avoidance?
A: When a process requests an available resource, system must decide if immediate allocation leaves the system in a safe state. System is in safe state if there exists a safe sequence of all processes. Deadlock Avoidance: ensure that a system will never enter an unsafe state.
Q: What is a Real-Time System?
A: A real time process is a process that must respond to the events within a certain time period. A real time operating system is an operating system that can run real time processes successfully
A: Operating system controls and coordinates the use of the hardware among the various applications programs for various uses. Operating system acts as resource allocator and manager. Since there are many possibly conflicting requests for resources the operating system must decide which requests are allocated resources to operating the computer system efficiently and fairly. Also operating system is control program which controls the user programs to prevent errors and improper use of the computer. It is especially concerned with the operation and control of I/O devices.
Q: Why paging is used?
A: Paging is solution to external fragmentation problem which is to permit the logical address space of a process to be noncontiguous, thus allowing a process to be allocating physical memory wherever the latter is available.
While running DOS on a PC, which command would be used to duplicate the entire diskette? diskcopy
Q: What resources are used when a thread created? How do they differ from those when a process is created?
A: When a thread is created the threads does not require any new resources to execute the thread shares the resources like memory of the process to which they belong to. The benefit of code sharing is that it allows an application to have several different threads of activity all within the same address space. Whereas if a new process creation is very heavyweight because it always requires new address space to be created and even if they share the memory then the inter process communication is expensive when compared to the communication between the threads.
Q: What is virtual memory?
A: Virtual memory is hardware technique where the system appears to have more memory that it actually does. This is done by time-sharing, the physical memory and storage parts of the memory one disk when they are not actively being used.
Q: What is Throughput, Turnaround time, waiting time and Response time?
A: Throughput – number of processes that complete their execution per time unit. Turnaround time – amount of time to execute a particular process. Waiting time – amount of time a process has been waiting in the ready queue. Response time – amount of time it takes from when a request was submitted until the first response is produced, not output (for time-sharing environment).
Q: What is the state of the processor, when a process is waiting for some event to occur?
A: Waiting state
Q: What is the important aspect of a real-time system or Mission Critical Systems?
A: A real time operating system has well defined fixed time constraints. Process must be done within the defined constraints or the system will fail. An example is the operating system for a flight control computer or an advanced jet airplane. Often used as a control device in a dedicated application such as controlling scientific experiments, medical imaging systems, industrial control systems, and some display systems. Real-Time systems may be either hard or soft real-time. Hard real-time: Secondary storage limited or absent, data stored in short term memory, or read-only memory (ROM), Conflicts with time-sharing systems, not supported by general-purpose operating systems. Soft real-time: Limited utility in industrial control of robotics, Useful in applications (multimedia, virtual reality) requiring advanced operating-system features.
Q: What is the difference between Hard and Soft real-time systems?
A: A hard real-time system guarantees that critical tasks complete on time. This goal requires that all delays in the system be bounded from the retrieval of the stored data to the time that it takes the operating system to finish any request made of it. A soft real time system where a critical real-time task gets priority over other tasks and retains that priority until it completes. As in hard real time systems kernel delays need to be bounded
Q: What is the cause of thrashing? How does the system detect thrashing? Once it detects thrashing, what can the system do to eliminate this problem?
A: Thrashing is caused by under allocation of the minimum number of pages required by a process, forcing it to continuously page fault. The system can detect thrashing by evaluating the level of CPU utilization as compared to the level of multiprogramming. It can be eliminated by reducing the level of multiprogramming.
Q: What is multi tasking, multi programming, multi threading?
A: Multi programming: Multiprogramming is the technique of running several programs at a time using timesharing. It allows a computer to do several things at the same time. Multiprogramming creates logical parallelism. The concept of multiprogramming is that the operating system keeps several jobs in memory simultaneously. The operating system selects a job from the job pool and starts executing a job, when that job needs to wait for any i/o operations the CPU is switched to another job. So the main idea here is that the CPU is never idle.
Multi tasking: Multitasking is the logical extension of multiprogramming .The concept of multitasking is quite similar to multiprogramming but difference is that the switching between jobs occurs so frequently that the users can interact with each program while it is running. This concept is also known as time-sharing systems. A time-shared operating system uses CPU scheduling and multiprogramming to provide each user with a small portion of time-shared system.
Multi threading: An application typically is implemented as a separate process with several threads of control. In some situations a single application may be required to perform several similar tasks for example a web server accepts client requests for web pages, images, sound, and so forth. A busy web server may have several of clients concurrently accessing it. If the web server ran as a traditional single-threaded process, it would be able to service only one client at a time. The amount of time that a client might have to wait for its request to be serviced could be enormous. So it is efficient to have one process that contains multiple threads to serve the same purpose. This approach would multithread the web-server process, the server would create a separate thread that would listen for client requests when a request was made rather than creating another process it would create another thread to service the request. To get the advantages like responsiveness, Resource sharing economy and utilization of multiprocessor architectures multithreading concept can be used.
Q: What is hard disk and what is its purpose?
A: Hard disk is the secondary storage device, which holds the data in bulk, and it holds the data on the magnetic medium of the disk.Hard disks have a hard platter that holds the magnetic medium, the magnetic medium can be easily erased and rewritten, and a typical desktop machine will have a hard disk with a capacity of between 10 and 40 gigabytes. Data is stored onto the disk in the form of files.
Q: What is fragmentation? Different types of fragmentation?
A: Fragmentation occurs in a dynamic memory allocation system when many of the free blocks are too small to satisfy any request.
External Fragmentation: External Fragmentation happens when a dynamic memory allocation algorithm allocates some memory and a small piece is left over that cannot be effectively used. If too much external fragmentation occurs, the amount of usable memory is drastically reduced. Total memory space exists to satisfy a request, but it is not contiguous.
Internal Fragmentation: Internal fragmentation is the space wasted inside of allocated memory blocks because of restriction on the allowed sizes of allocated blocks. Allocated memory may be slightly larger than requested memory; this size difference is memory internal to a partition, but not being used
Q: What is DRAM? In which form does it store data?
A: DRAM is not the best, but it’s cheap, does the job, and is available almost everywhere you look. DRAM data resides in a cell made of a capacitor and a transistor. The capacitor tends to lose data unless it’s recharged every couple of milliseconds, and this recharging tends to slow down the performance of DRAM compared to speedier RAM types.
Q: What is Dispatcher?
A: Dispatcher module gives control of the CPU to the process selected by the short-term scheduler; this involves: Switching context, Switching to user mode, Jumping to the proper location in the user program to restart that program, dispatch latency – time it takes for the dispatcher to stop one process and start another running.
Q: What is CPU Scheduler?
A: Selects from among the processes in memory that are ready to execute, and allocates the CPU to one of them. CPU scheduling decisions may take place when a process: 1.Switches from running to waiting state. 2.Switches from running to ready state. 3.Switches from waiting to ready. 4.Terminates. Scheduling under 1 and 4 is non-preemptive. All other scheduling is preemptive.
Q: What is Context Switch?
A: Switching the CPU to another process requires saving the state of the old process and loading the saved state for the new process. This task is known as a context switch. Context-switch time is pure overhead, because the system does no useful work while switching. Its speed varies from machine to machine, depending on the memory speed, the number of registers which must be copied, the existed of special instructions(such as a single instruction to load or store all registers).
Q: What is cache memory?
A: Cache memory is random access memory (RAM) that a computer microprocessor can access more quickly than it can access regular RAM. As the microprocessor processes data, it looks first in the cache memory and if it finds the data there (from a previous reading of data), it does not have to do the more time-consuming reading of data from larger memory.
Q: What is a Safe State and what is its use in deadlock avoidance?
A: When a process requests an available resource, system must decide if immediate allocation leaves the system in a safe state. System is in safe state if there exists a safe sequence of all processes. Deadlock Avoidance: ensure that a system will never enter an unsafe state.
Q: What is a Real-Time System?
A: A real time process is a process that must respond to the events within a certain time period. A real time operating system is an operating system that can run real time processes successfully
Sunday, April 18, 2010
Friday, April 16, 2010
Subscribe to:
Posts (Atom)