Normalization Flashcards
4 Rules of Normalization
1) No redundancy of facts
2) No Cluttering of facts
3) Must preserve information
4) Must preserve functional dependencies
Problems with Updating Non Normalized Relations
If updating a property, any rows with duplicates will have to be updated for consistency, and that causes problems if it’s missed.
Problems with Deletion from Non Normalized Relations
Deleting a row could potentially affect an application expecting data back from some other part of the relation. IE.
User born in 1978 has Salary 62k. If you want the birth year corresponding to 62k, deleting the User could have nasty side effects.
Problems with Inserting into Non Normalized Relations
If we have many columns that are non normalized, we may have functional dependencies we’d like to write in but cannot without creating NULL values elsewhere.
IE. Users have Username, Birth Year, and Salary properties. However I just want to represent that Birth Year relates to a certain Salary. I can’t do that without a null Username column.
Problems with Redundancy in Non Normalized Relations
Repeating certain pieces of information to avoid multivalue problem can cause inconsistency.
Information Loss Problem
If we decompose a Relation into two relations, we could get too many rows back when we recombine them. Causes an inability to distinguish fact vs ficition in the db.
IE. User has Email, Interest, SinceAge, BirthYear, Salary, Current City.
Current City and Salary can have their own relation, and Email, Interest, SinceAge, Birth Year, and Current City.
However joining these could produce more rows than are pertinent due to CurrentCity not having an identifier where to go.
Dependency Loss
When splitting up a Non Normalized Relation into multiple relations, we may lose the ability to enforce Functional Dependencies in the process. This says that the Normalization was done incorrectly.
Functional Dependencies
A property or set of properties that determine another property or set of properties.
Email uniquely determines City and BirthYear
Email + Interest uniquely determine SinceAge.
These cannot be broken in a series of relations otherwise the solution isn’t Normalized.
Fully Functional Dependencies
A functional dependency x -> y where y is functionally dependent on x and y is not functionally dependent on any propert subset of x.
ABC --> D is fully functional if ! A --> D ! B --> D ! C --> D ! AC --> D etc etc. Only ABC together determines D
Keys and Functional Dependencies
Using keys to enforce uniqueness helps to enforce functional dependencies (x–>y)
Non Normal Form
(NF^2) no data normalized
1st Normal Form
(NF1) all attributes are atomic (no composite values)
2nd Normal Form
(2NF) 1NF + every nonkey attribut is fully dependent on the key.
3rd Normal Form
(3NF) 2NF + every nonkey attribute is non-transitively dependent on the key
Boyce Codd Normal Form
(BCNF) 3NF + every determinant is a candidate key.
ie. For each FD X –> Y, X must be a Super Key
Determinant
a set of attributes on whice some other attribute is fully functionally dependent.
Kent and Diehr quote
All attributes must depend on the key (1NF), the whole key (2NF), and nothing but the key (3NF) so help me Codd!
Armstrongs Reflexivity Rule
if Y is part of X, then X –> Y
ex. Email, Interest –> Interest
Armstrong’s Augmentation Rule
If X –> Y, then WX –> WY
ex. If Email –> BirthYear then
Email, Interest –> BirthYear, Interest
Armstrong’s Transitivity Rule
If X –> Y and Y –> Z then X –> Z
Email –> BirthYear and BirthYear –> Salary
then Email –> Salary
How to guarantee lossless joins
The join field must be a key in at least one of the relations
How do we guarantee preservation of FDs?
The meaning implied by the remaining functional dependencies must be the same as the meaning implied by the original set. Also using Armstrong’s Rules.
Normalization Guideline 1
Design relation schema so that it’s easy to explain its meaning.
Normalization Guideline 2
Design the base relation schema so that no insertion, deletion, or modification anomalies are present.
Insertion Anomalies
Where a poorly designed table with multiple entities in multiple rows creates a situation where Inserting a row for a single entity forces an implementer to fill in data non pertinent entity sharing the row, or Insert NULL values for those columns.
Inserting NULLs for PKs is impossible so this has to be considered.
Deletion Anomalies
Where a poorly designed table with multiple entities in multiple rows creates a situation where deleting a row to eliminate one entity may completely delete another entity sharing the row.
Modification Anomalies
Where a poorly designed table with multiple entities in multiple rows creates a situation where updating an entity forces an implementer update all rows that entity exists in.
Failing to update all of those rows could create consistency issues later on.
Normalization Guideline 3
Avoid placing attributes in a base relation whose values may frequently be NULL.
NULL Values in Tuples
NULL Values waste space which can complicate on Logical Level
NULL Values can be interpreted in different ways
1) Attribute doesn’t apply to tuple
2) Attribute value is unknown for this tuple
3) Value is known but absent from the tuple
Normalization Guideline 4
Design relation schema so that they can be joined with equality conditions on attributes that are appropriately related.
Closure of an Attribute
The set of attributes determined by the attribute in question.
Denoted with a superscript plus sign or Cl( x ) | x = attribute in question.
ie. R( A B C D)
A –> B
B –> D
C –> B
A+ –> A B D
Cl ( A ) = {A B D}
Candidate Key
A key not determined by any other attribute, and determines every other attribute.
(Pro Tip, when looking at a set of FDs, candidate keys will always be attributes not present in the right-hand side)
Prime Attribute
an attribute belonging to a candidate key.
ie R( A B C D )
AB –> C
C –> D
A and B are prime attributes
Non-Prime Attribute
an attribute in a relation not belonging to a candidate key
ie R( A B C D )
AB –> C
C –> D
C and D are non-prime attributes
Super Key
Specifies that no two distinct tuples can have the same value for Super Key (ie the combinations of attributes that make up SK will never be the same twice and identify distinct rows)