DB Design - CL1 Flashcards
Relational terminology: Entities
What Is a Database?
Database terminology is almost as slippery as the term “object-oriented
programming.” The word “database” can be used to describe everything from a
single set of data, such as a telephone list, to a complex set of tools, such as SQL
Server, and a whole lot in between. This lack of precision isn’t a bad thing,
necessarily—it’s just the nature of language—but it’s not particularly useful for
our purposes, so I’ll try to be a bit more precise here. Figure 1-1 shows the
relationships between the terms discussed below.
Although relational databases don’t have real-world analogies, most are intended
to model some aspect of the real world. I’ll call that bit of the real world the
problem space. The problem space, by its nature, is messy and complex—if it
weren’t, we wouldn’t need to build a model of it. But it is critical to the success
of your project to limit the database system you’re designing to a specific, welldefined set of objects and interactions; only by doing so can you make sensible
decisions about the scope of your system.
I’ll use the term data model to mean the conceptual description of the problem
space. This includes the definition of entities, their attributes (a Customer, for
example, is an entity, and it might have the attributes Name and Address), and
the entity constraints (such as, the CustomerName cannot be empty). The data
model also includes a description of the relationships between entities and any
constraints on those relationships—for example, managers are not allowed to
have more than five individuals reporting to them. It does not include any
reference to the physical layout of the system.
The definition of the physical layout—the tables and views that will be
implemented—is the database schema or just schema. It’s the translation of the
conceptual model into a physical representation that can be implemented using a
database management system. Note that the schema is still conceptual, not
physical. The schema is nothing more than the data model expressed in the terms
that you will use to describe it to the database engine—tables and triggers and
such creatures. One of the benefits of using a database engine is that you don’t
ever have to deal with the physical implementation; you can largely ignore Btrees and leaf nodes.
Once you’ve explained to the database engine what you want the data to look
like, using either code or an interactive environment such as Microsoft Access,
the engine will create some physical objects (usually, but not always, on a hard
disk someplace) and you’ll store data in them. The combination of structure and
data is what I’ll refer to as a database. This database includes the physical tables;
the defined views, queries, and stored procedures; and the rules the engine will
enforce to protect the data
The term “database” does not include the application, which consists of the
forms and reports with which your users will interact, nor does it include any of
the bits and pieces—things such as middleware or Microsoft Transaction Server
—used to stick the front and back ends together. The term “database” also
excludes the database engine. Thus, an Access .mdb file is a database, while
Microsoft Jet is a database engine. Actually, an .mdb file can contain other
things besides the database—forms and reports, for example—but that’s a topic
we’ll discuss later.
To describe all these components—the application, the database, the database
engine, and the middleware—I’ll use the term database system. All of the
software and data that goes into making a production system is part of the
database system.
The Relational Model
The relational model is based on a collection of mathematical principles drawn
primarily from set theory and predicate logic. These principles were first applied
to the field of data modeling in the late 1960s by Dr. E. F. Codd, then a
researcher at IBM, and first published in 1970.
1 The relational model defines the
way data can be represented (data structure), the way data can be protected (data
integrity), and the operations that can be performed on data (data manipulation).
The relational model is not the only method available for storing and
manipulating data. Alternatives include the hierarchical, network, and star
models. Each of these models has its advocates, and each has its advantages for
certain kinds of tasks. The relational model is not particularly well-suited for
handling hierarchical data, for instance, a problem the star model was
specifically designed to address. But because of its efficiency and flexibility, the
relational model is by far the most popular database technique and is the one
discussed in this book. Both the Microsoft Jet database engine and Microsoft
SQL Server implement the relational model.
In general terms, relational database systems have the following characteristics:
- All data is conceptually represented as an orderly arrangement of data into
rows and columns, called a relation.
- All values are scalar. That is, at any given row/column position in the
relation there is one and only one value.
- All operations are performed on an entire relation and result in an entire
relation, a concept known as closure.
If you’ve worked with Microsoft Access databases at all, you’ll recognize a
“relation” as a “recordset” or, in SQL Server terms, as a “result set.” Dr. Codd,
when formulating the relational model, chose the term “relation” because it was
comparatively free of connotations, unlike, for example, the word “table.” It’s a
common misconception that the relational model is so called because
relationships are established between tables. In fact, the name is derived from the
relations on which it’s based.
Notice that the model requires only that data be conceptually represented as a
relation; it doesn’t specify how the data should be physically implemented. This
separation of the conceptual and physical representations, although it seems
obvious now, was a major innovation 30 years ago when database programming
generally meant writing machine code to physically manipulate the data storage
devices.
In fact, relations need not have a physical representation at all. A given recordset
might map to an actual physical table someplace on a disk, but it can just as well
be based on columns drawn from half a dozen different tables, with a few
calculated fields—which aren’t physically stored anywhere—thrown in for good
measure. A relation is a relation provided that it’s arranged in row and column
format and its values are scalar. Its existence is completely independent of any
physical representation.
The principle of closure—that both base tables and the results of operations are
represented conceptually as relations—enables the results of one operation to be
used as the input to another operation. Thus, with both the Jet database engine
and SQL Server we can use the results of one query as the basis for another. This
provides database designers with functionality similar to a subroutine in
procedural development: the ability to encapsulate complex or commonly
performed operations and reuse them wherever necessary.
For example, you might have created a query called FullNameQuery that
concatenates the various attributes representing an individual’s name into a
single calculated field called FullName. You can create a second query using
FullNameQuery as a source that uses the calculated FullName field just like any
field that’s actually present in the base table. There is no need to recalculate the
name.
The requirement that all values in a relation be scalar can be somewhat
treacherous. The concept of “one value” is necessarily subjective, based as it is
on the semantics of the data model. To give a common example, a “Name” might
be a single value in one model, but another environment might require that that
value be split into “Title”, Given Name”, and “Surname”, and another might
require the addition of “Middle Name” or “Title of Courtesy”. None of these is
more or less correct in absolute terms; it depends on the use to which the data
will be put.
Entities
It’s difficult to provide a precise formal definition of the term entity, but the
concept is intuitively quite straightforward: an entity is anything about which the
system needs to store information.
When you begin to design your data model, compiling an initial list of entities
isn’t difficult. When you (or your clients) talk about the problem space, most of
the nouns and verbs used will be candidate entities. “Customers buy products.
Employees sell products. Suppliers sell us products.” The nouns “Customers,”
“Products,” “Employees,” and “Suppliers” are all clearly entities.
The events represented by the verbs “buy” and “sell” are also entities, but a
couple of traps exist here. First, the verb “sell” is used to represent two distinct
events: the sale of a product to a customer and the purchase of a product by the
organization. That’s fairly obvious in this example, but it’s an easy trap to fall
into, particularly if you’re not familiar with the problem space.
The second gotcha is the inverse of the first: two different verbs (“buy” in the
first sentence and “sell” in the second) are used to describe the same event, the
purchase of a product by a customer. Again, this isn’t necessarily obvious unless
you’re familiar with the problem space. This problem is often trickier to track
down than the first. If a client is using different verbs to describe what appears to
be the same event, they might in fact be describing different kinds of events. If
the client is a tailor, for example, “customer buys suit” and “customer orders
suit” might both result in the sale of a suit, but in the first case it’s a prêt-à-porter
sale and in the second it’s bespoke. These are very different processes that might
need to be modeled differently.
In addition to interviewing clients to establish a list of entities, it’s also useful to
review any documents that exist in the problem space. Input forms, reports, and
procedures manuals are all good sources of candidate entities. You must be
careful with documents, however. Printed documents have a great deal of inertia:
input forms particularly are expensive to print and frequently don’t keep up with
changes to policies and procedures. If you stumble across an entity that’s never
come up in an interview, don’t assume the client just forgot to mention it.
Chances are that it’s a legacy item that’s no longer pertinent to the organization.
You’ll need to check.
Once you’ve developed an initial list of candidate entities, you need to review
them for completeness and consistency. Again, you need to find duplicates and
distinct entities that are masquerading as the same entity. A useful tool in this
process is the concept of entity subtypes. To return to our example of the tailor,
“prêt-à-porter” and “bespoke” both represent the purchase of an item of clothing,
but they’re different kinds of purchases. In other words, Sale and Order are both
subtypes of the entity Purchase.
Attributes that are common to both types of Purchase are assigned to the
supertype, in this case, Purchase, and attributes specific to a subtype—
PretAPorter or Bespoke in this instance—are factored out to that subtype. This
allows both kinds of events to be treated as generic Purchases when that is
appropriate (as when calculating total sales) or as specific kinds of Purchases (as
when comparing subtypes).
Sometimes you might discover that the entity subtypes don’t actually have
distinct attributes, in which case it’s more convenient to make TypeOfSale (or
TypeOfCustomer, or TypeOfWhatever) an attribute of the supertype rather than
modeling the subtypes as distinct entities. With our tailoring example for
bespoke sales you might need to know the cloth and color selected, whereas for
prêt-à-porter sales you might need to track the garment manufacturer. In this
case, you would use subtypes to model these entities. If, however, you only need
to know that a sale was bespoke or prêt-à-porter, a TypeOfSale attribute would
be simpler to implement.
Subtypes are usually mutually exclusive, but this is by no means always the
case. Consider an employee database. All employees have certain attributes in
common (hire date, department, and telephone extension), but only some will be
salespeople (with specific attributes for commission rate and target) and only a
few will join the company softball team. There’s nothing preventing a
salesperson from playing basketball, however.
Most entities model objects or events in the physical world: customers, products,
or sales calls. These are concrete entities. Entities can also model abstract
concepts. The most common example of an abstract entity is one that models the
relationship between other entities—for example, the fact that a certain sales
representative is responsible for a certain client or that a certain student is
enrolled in a certain class.
Sometimes all you need to model is the fact that a relationship exists. Other
times you’ll want to store additional information about the relationships, such as
the date on which it was established or some characteristic of the relationship.
The relationship between cougars and coyotes is competitive, that between
cougars and rabbits is predatory, and it’s useful to know this if you’re planning an
open-range zoo.
Whether relationships that do not have attributes ought to be modeled as separate
entities is a matter of some discussion. I don’t think anything is gained by doing
so, and it complicates the process of deriving a database schema from the data
model. However, understand that relationships are as important as entities are in
the data model.
Relational terminology: Attributes
Your system will need to keep track of certain facts about each entity. These
facts are referred to as the entity’s attributes. If your system includes Customer
entities, for example, you’ll probably want to know the names and addresses of
the customers and perhaps the businesses they’re in. If you’re modeling an event
such as a Service Call, you’ll probably want to know who the customer was, who
made the call, when it was made, and whether the problem was resolved.
Determining the attributes to be included in your model is a semantic process.
You must make your decisions based on what the data means and how it will be
used. Let’s look at one common example: an address. Do you model the address
as a single entity (the Address) or as a set of entities (HouseNumber, Street, City,
State, ZipCode)? Most designers (myself included) would tend to automatically
break the address up into a set of attributes on the general principle that
structured data is easier to manipulate, but this is not necessarily correct and
certainly not straightforward.
Let’s take, for instance, a local amateur musical society. It will want to store the
addresses of its members in order to print mailing labels. Since all the members
live in the same city, there is no reason to ever look at an address as anything
other than a blob: a single, multiline chunk of text that gets spat out on demand.
But what about a mail-order company that does all its business on the Internet?
For sales tax purposes, the company needs to know the states in which its
customers reside. While it’s possible to extract the state from the single text field
used by the musical society, it isn’t easy; so it makes sense in this case to at least
model the state as a separate attribute. What about the rest of the address?
Should it be composed of multiple attributes, and if so, what are they? Be aware
that while addresses in the United States conform to a fairly standard pattern,
modeling them is probably not as simple as it appears.
You might think that a set of attributes {HouseNumber, Street, City, State,
ZipCode} might be adequate. But then you need to deal with apartment numbers
and post office boxes and APO addresses. What do you do with an address to be
sent in care of someone else? And of course the world is getting smaller but not
less complex, so what happens when you get your first customer outside the
United States? Not only do you need to know the country and adjust the zip
code, but the arrangement of the attributes might need to change. In most of
Europe, for example, the house number follows the street name. That’s not too
bad, it’s easy enough to map that when you’re entering data, but how many of
your users would know that in the address 4/32 Griffen Avenue, Bondi Beach,
Australia, 4/32 means Apartment 4, Number 32?
The point here is not so much that addresses are hard to model, although they
are, but rather that you can’t make any assumptions about how you should model
any specific kind of data. The complex schema that you develop for handling
international mail order is completely inappropriate for the local musical society.
Matisse is reputed to have said that a painting was finished when nothing could
be either added or subtracted. Entity design is a bit like that. How do you know
when you’ve reached that point? The unfortunate answer is that you can never
know for certain. At the current state of technology, there isn’t any way to
develop a provably correct database design. You can prove that some designs
have flaws, but you can’t prove that any given design doesn’t. You can’t, if you
will, prove your innocence. How do you tackle this problem? There are no rules,
but there are some strategies.
The first strategy is: start with the result and don’t make the design any more
complex than it needs to be.
What questions does your database have to answer? In our first example, the
musical society, the only question was “Where do I mail a letter to this person?”,
so a single-attribute model was sufficient. The second example, the mail order
company, also had to answer “In what state does this person live?”, so we needed
a different structure to provide the results.
You need to be careful, of course, that you try to provide the flexibility to handle
not just the questions your users are asking now but also the ones you can
foresee them asking in the future. I’d be willing to bet, for instance, that within a
year of implementing the musical society system the society will come back
asking you to sort the addresses by zip code so that they can qualify for bulk
mail discounts.
You should also be on the lookout for questions the users would ask if they only
knew they could, particularly if you’re automating a manual system. Imagine
asking a head librarian how many of the four million books in the collection
were published in Chicago before 1900. He or she would point you to the card
file and tell you to have fun. Yet this is trivial information to request from a welldesigned database system.
One of the hallmarks of good designers is the thoroughness and creativity with
which they solicit potential questions. Inexperienced analysts are frequently
heard to remark that the users don’t know what they want. Of course they don’t;
it’s your job to help them discover what they want.
There’s a trap here, however. Often, the trade-off for flexibility is increased
complexity. As we saw with the address examples, the more ways you want to
slice and dice the data the more exceptions you have to handle, and there comes
a point of diminishing returns.
This leads me to strategy two: find the exceptions. There are two sides to this
strategy: first that it is important to identify all the exceptions, and second that
you must design the system to handle as many exceptions as you can without
confusing users. To illustrate what this means, let’s walk through another
example: personal names.
If your system will be used to produce correspondence, it’s crucial that you get
the name right. (Case in point: any unsolicited mail arriving at my house
addressed to Mr. R. M. Riordan doesn’t even get opened.) Most names are pretty
straightforward. Ms. Jane Q. Public consists of the Title, FirstName,
MiddleInitial, and LastName, right? Wrong. (You saw this coming, didn’t you?)
In the first place, FirstName and LastName are culturally specific. It’s more
correct to use GivenName and Surname. Next, what happens to Sir James
Peddington Smythe, Lord Dunstable? Is Peddington Smythe his Surname or is
Peddington his MiddleName, and what do you do about the “Lord Dunstable”
part? And the singer Sting? Is that a GivenName or a Surname? And what will
happen to The Artist Formerly Known as Prince? Do you really care?
That last question isn’t as flippant as it sounds. A letter addressed to Sir James
Peddington Smythe probably won’t offend anyone. But the gentleman in
question is not Sir Smythe; he’s Sir James, or maybe Lord Dunstable.
Realistically, though, how many of your clients are lords of a realm? The local
musical society is not going to thank you for giving them a membership system
with a screen like the one in Figure 1-4.
So be aware that there’s a trade-off between flexibility and complexity. While it’s
important to catch as many exceptions as possible, it’s perfectly reasonable to
eliminate some of them as too unlikely to be worth the cost of dealing with them.
Distinguishing between entities and attributes is sometimes difficult. Again,
addresses are a good example, and again, your decision must be based on the
problem space. Some designers advocate the creation of a single address entity
used to store all the addresses modeled by the system. From an implementation
viewpoint, this approach has certain advantages in terms of encapsulation and
code reuse. From a design viewpoint, I have some reservations.
It’s unlikely that addresses for employees and customers will be used in the same
way. Mass mailings to employees, for example, are more likely to be done via
internal mail than the postal service. This being the case, the rules and
requirements are different. That awful data entry screen shown in Figure 1-4
might very well be justified for customer addresses, but by using a single address
entity you’re forced to use it for employees as well, where it’s unlikely to be
either necessary or appreciated.
Domains
You might recall from the beginning of this chapter that a relation heading
contains an AttributeName:DomainName pair for each attribute. I said then that
a domain definition specifies the kind of data represented by the attribute. More
particularly, a domain is the set of all possible values that an attribute may
validly contain.
Domains are often confused with data types, but this is inaccurate. Data type is a
physical concept while domain is a logical one. “Number” is a data type, and
“Age” is a domain. To give another example, “StreetName” and “Surname”
might both be represented as text fields, but they are obviously different kinds of
text fields; they belong to different domains.
Domain is also a broader concept than data type in that a domain definition
includes a more specific description of the valid data. Take, for example, the
domain DegreeAwarded, which represents the degrees awarded by a university.
In the database schema, this attribute might be defined as Text[3], but it’s not just
any three-character string, it’s a member of the set {BA, BS, MA, MS, PhD,
LLD, MD}.
Of course, not all domains can be defined by simply listing their values. Age, for
example, contains a hundred or so values if we’re talking about people, but tens
of thousands if we’re talking about museum exhibits. In such instances it’s useful
to define the domain in terms of the rules which can be used to determine the
membership of any specific value in the set of all valid values. For example,
PersonAge could be defined as “an integer in the range 0 to 120,” whereas
ExhibitAge might simply be “an integer equal to or greater than 0.”
“Ah,” I can hear you saying to yourself. “Domain is the combination of the data
type and the validation rule.” Well, if you think of them this way you won’t go
too far wrong. But remember, validation rules are part of the data integrity, not
part of the data description. For example, the validation rule for a zip code might
refer to the State attribute, whereas the domain of ZipCode is “a six-digit string.”
Note that all these definitions make some reference to the kind of data stored
(number or string). This looks a whole lot like a data type, but it isn’t really. Data
types, as I’ve said, are physical; they’re defined and implemented in terms of the
database engine. It would be a mistake developing the data model to define a
domain as varchar(30) or a Long Integer, which are engine-specific descriptions.
For any two domains, if it makes sense to compare attributes defined on them
(and, by extension, to perform relational operations such as joins, which we’ll
discuss in Chapter 5), then the two domains are said to be type-compatible. For
example, given the two relations in Figure 1-5, it would be perfectly sensible to
link them on EmployeeID = SalespersonID—you might do this to obtain a list of
invoices for a given employee, for example. The domains EmployeeID and
SalespersonID are type-compatible. Trying to combine the relations on
EmployeeID = OrderDate will probably not result in a meaningful answer, even
if the two domains were defined on the same data type.
Unfortunately, neither the Jet database engine nor SQL Server provides strong
intrinsic support for domains, beyond data types. And even within data types,
neither engine performs strong checking; both will quietly convert data behind
the scenes. For example, if you’ve defined EmployeeID as a long integer in the
Employees table and InvoiceTotal as a currency value in the Invoices table in the
Microsoft Access Northwind sample database, you can create a query linking the
two tables with the criteria WHERE EmployeeID = InvoiceTotal, and Microsoft
Jet will quite happily give you a list of all employees who have an EmployeeID
that matches the total value of an invoice. The two attributes are not typecompatible, but the Jet database engine doesn’t know that.
So why bother with domains at all? Because, as we’ll see in Part 2, they’re
extremely useful design tools. “Are these two attributes interchangeable?” “Are
there any rules that apply to one but don’t apply to the other?” These are
important questions when you’re designing a data model, and domain analysis
helps you think about them.
Relational terminology: Records (Tuples)
Figure 1-3 shows a relation with the formal names of the basic components
marked. Those of you who know something about relational design will
recognize that the relation is not in normal form. That’s okay; it still qualifies as
a relation because it’s arranged in row and column format and its values are
scalar.
Figure 1-3. The components of a relation.
The entire structure is, as we’ve said, a relation. Each row of data is a tuple
(rhymes with “couple”). Actually, each row is an n-tuple, but the “n-“ is usually
dropped. The number of tuples in a relation determines its cardinality. In this
case, the relation has a cardinality of 11. Each column in the tuple is called an
attribute. The number of attributes in a relation determines its degree. The
example relation has a degree of 3.
The relation is divided into two sections, the heading and the body. The tuples
make up the body, while the heading is composed of, well, the heading. Note
that the label for each attribute is composed of two terms separated by a colon—
for example, UnitPrice:Currency. The first part of the label is the name of the
attribute, while the second part is its domain. The domain of an attribute is the
“kind” of data it represents—in this case, currency. A domain is not the same as a
data type. We’ll be discussing this issue in detail in the next section. The
specification of domain is often dropped from the heading.
The body of the relation consists of an unordered set of zero or more tuples.
There are some important concepts here. First, the relation is unordered. Record
numbers do not apply to relations. Relations have no intrinsic order. Second, a
relation with no tuples (an empty relation) still qualifies as a relation. Third, a
relation is a set. The items in a set are, by definition, uniquely identifiable.
Therefore, for a table to qualify as a relation, each record must be uniquely
identifiable and the table must contain no duplicate records.
If you’ve read the Access or SQL Server documentation, you might be
wondering why you’ve never seen any of these words before. They’re the formal
terminology used in the technical literature, not the terms used by Microsoft. I’ve
included them just so you won’t be embarrassed at cocktail parties (at least not
about n-tuples of third degree, anyway). Remember that relations are purely
conceptual; once instantiated in the database, they become recordsets in
Microsoft Jet and result sets in SQL Server, while for both Microsoft Jet and
SQL Server an attribute becomes a field and a tuple becomes a record. The
correspondences are pretty much one-to-one, but remember that relations are
conceptual and recordsets and result sets are physical.
Relationships (One-to-One, One-to-Many)
Relationships
In addition to the attributes of each entity, a data model must specify the
relationships between entities. At the conceptual level, relationships are simply
associations between entities. The statement “Customers buy products” indicates
that a relationship exists between the entities Customers and Products. The
entities involved in a relationship are called its participants. The number of
participants is the degree of the relationship. (The degree of a relationship is
similar to, but not the same as, the degree of a relation, which is the number of
attributes.)
The vast majority of relationships are binary, like the “Customers buy products”
example, but this is not a requirement. Ternary relationships, those with three
participants, are also common. Given the binary relationships “Employees sell
products” and “Customers buy products,” there is an implicit ternary relationship
“Employees sell products to customers.” However, specifying the two binary
relationships does not allow us to identify which employees sold which products
to which customers; only a ternary relationship can do that.
A special case of a binary relationship is an entity that participates in a
relationship with itself. This is often called the bill of materials relationship and
is most often used to represent hierarchical structures. A common example is the
relationship between employees and managers: any given employee might both
be a manager and have a manager.
The relationship between any two entities can be one-to-one, one-to-many, or
many-to-many. One-to-one relationships are rare, most often being used between
supertype and subtype entities. To return to our earlier example, the relationship
between an employee and that employee’s salesperson details is one-to-one.
One-to-many relationships are probably the most common type. An invoice
includes many products. A salesperson creates many invoices. These are both
examples of one-to-many relationships.
Although not as common as one-to-many relationships, many-to-many
relationships are also not unusual and examples abound. Customers buy many
products, and products are bought by many customers. Teachers teach many
students, and students are taught by many teachers. Many-to-many relationships
can’t be directly implemented in the relational model, but their indirect
implementation is quite straightforward, as we’ll see in Chapter 3.
The participation of any given entity in a relationship can be partial or total. If it
is not possible for an entity to exist unless it participates in the relationship, the
participation is total; otherwise, it is partial. For example, Salesperson details
can’t logically exist unless there is a corresponding Employee. The reverse is not
true. An employee might be something other than a salesperson, so an Employee
record can exist without a corresponding Salesperson record. Thus, the
participation of Employee in the relationship is partial, while the participation of
Salesperson is total.
The trick here is to ensure that the specification of partial or total participation
holds true for all instances of the entity for all time. It’s not unknown for
companies to change suppliers for a product, for example. If the participation of
Products in the “Suppliers provide products” relation has been defined as total, it
won’t be possible to delete the current supplier without deleting the other product
details.
Entity Relationship Diagrams
The Entity Relationship model, which describes data in terms of entities,
attributes, and relations, was introduced by Peter Pin Shan Chen in 1976.
2 At the
same time, he proposed a method of diagramming called Entity Relationship
(E/R) diagrams, which has become widely accepted. E/R diagrams use
rectangles to describe entities, ellipses for attributes, and diamonds to represent
relationships, as shown in Figure 1-6.
The nature of the relationship between entities (one-to-one, one-to-many, or
many-to-many) is represented in various ways. A number of people use 1 and M
or 1 and ¥ (representing infinity) to represent one and many. I use the “crow’s
foot” technique shown in Figure 1-6, which I find more expressive.
The great advantage of E/R diagrams is that they’re easy to draw and understand.
In practice, though, I usually diagram the attributes separately, since they exist at
a different level of detail.
Generally, one is either thinking about the entities in the model and the
relationships between them or thinking about the attributes of a given entity, but
rarely thinking about both at the same time.
Some basic terminology applies to relationships. The entities that are related are
called participants, and the number of participants in a relationship is its degree.
The vast majority of relationships are binary, having two participants, but unary
relationships (a relation that is related to itself) are also common, while ternary
relationships (with three participants) are not unknown. The examples
throughout most of this chapter are binary. We’ll examine unary and ternary
relationships as special cases later in the chapter.
The participation of an entity in a relationship can be classified as total
participation or partial participation, depending on whether the entity can exist
without participating in the relationship. For example, given the two entities
Customer and Order, the participation of Customer in the relationship is partial
since Customer details might be entered before the customer has placed any
orders. Order, on the other hand, has total participation, since an order can’t be
placed except by a Customer.
The same principle is sometimes used to classify the entities themselves as either
weak (having total participation) or regular (having partial participation). Weak
entities can exist only in relationships with other entities, while regular entities
can exist in isolation. This classification is part of the Entity Relationship (E/R)
diagramming method as originally described by Chen.
Of the three ways to classify a relationship—total or partial, optional or
mandatory, and in terms of weak or regular entities—I have always found
optionality to be the most useful. However, explicitly expressing the effect of the
relationship on the entity can be helpful in large or complex systems.
It’s occasionally useful to divide relationships into “IsA” and “HasA” types. The
concept is straightforward: entity A either IsA B or HasA B. For example, an
Employee IsA SoftballTeam member; the same Employee HasA(n) Address. Of
course, “is” and “has” are not always very good English terms to describe a
relationship. An Employee doesn’t “have” a SalesOrder, he “creates” one; but as
it’s clearly not the case that the Employee is a SalesOrder, the intellectual stretch
isn’t too great.
The classification of participation in a relationship is also an indication of the
optionality of the relationship: whether or not an entity is required to participate
in a given relationship. This is a rather tricky area because implementation by
the database engines doesn’t match the problem domain, as we’ll see when we
discuss the implementation of data integrity in Chapter 4.
The maximum number of instances of one entity that can be associated with an
instance of another entity is what I refer to as the cardinality of a relationship.
(Note that both degree and cardinality have slightly different meanings when
applied to relationships than when applied to relations.) There are three generic
flavors of cardinality: one-to-one, one-to-many, and many-to-many.
I use the notation shown in Figure 3-2 to indicate the cardinality and optionality
of relationships. I find the crow’s foot notation (introduced in Chapter 1) the
most expressive and simplest to explain to clients. Obviously, alternative
techniques are available, and you must use the one that works best for you.
Modeling Relationships
Once you’ve determined that a relationship exists, you must model it by
including attributes from one relation (the primary relation) in the other (the
foreign relation), as shown in Figure 3-3.
You’ll notice some differences between this diagram and the formal E/R diagram
shown in Figure 1-6 (Chapter 1). First, the attributes are not shown as separate
objects. At this level of design, you’re primarily interested in the relationships
between entities, not their composition. I find showing the attributes to be
distracting, and it clutters up the diagram.
Second, the relationships are not labeled. I find labeling unnecessary, and
because the description of the relationship changes depending on which direction
you’re reading it (teachers teach students, but students learn from teachers),
labels can sometimes be confusing. However, although I don’t label the
relationships, I do sometimes indicate the attribute that will be used to
implement the relation in the database schema. This can be useful, for example,
if the primary entity has more than one candidate key and you want to show
explicitly which one is to be used.
As I said, I’ve found this style of diagram useful when working with clients, and
it’s easy to draw by hand or by using a diagramming tool such as Visio
Professional or Micrografx Flowcharter 7. But Microsoft Access, Microsoft SQL
Server, and Microsoft Visual Basic all provide diagramming tools as well, and
you may decide to use them in preference to, or in addition to, the technique I’ve
shown.
The advantage of using either the Access relationships window (for a Jet
database engine .mdb file) or database diagrams (for a database implemented
using SQL Server) is that the diagrams become part of the database and
automatically reflect changes to it. Unfortunately, this is also the greatest
disadvantage of these tools. You can’t build abstract diagrams; you must first
build the physical tables. There is always a danger in jumping into anything that
even looks like implementation too early in the design process, before the
implementation model has been finalized.
In my own work, I often use both abstract diagrams and ones that are embedded
in the database. I create abstract diagrams in the early stages of the design
process, and use one of the Microsoft tools once the conceptual design has been
finalized and I’m documenting the physical database schema.
Of course, you don’t just copy any attributes from the primary relation to the
foreign relation; you must choose attributes that uniquely identify the primary
entity. In other words, you add the attributes that make up the candidate key in
the primary relation to the foreign relation. Not surprisingly, the duplicated
attributes become known as the foreign key in the foreign relations. In the
example shown in Figure 3-3, OrderID—the candidate key of the Orders relation
—has been added to the OrderDetails relation. Orders is the primary relation,
and OrderDetails is the foreign relation.
NOTE
The candidate key/foreign key pair that models the relationship need
not be the primary key of the primary table; any candidate key will
serve. You should use the candidate key that makes the most sense
semantically.
The choice of primary and foreign relations isn’t arbitrary. It is determined first
by the cardinality of the relationship and second—when there is any doubt—by
the semantics of the data model. For example, given two relations that have a
one-to-many relationship, the relation on the one side is always the primary
relation, while the relation on the many side is always the foreign relation. That
is, a candidate key from the relation on the one side is added (as a foreign key) to
the relation on the many side. We’ll be looking at this issue as we examine each
type of relationship in the rest of this chapter.
Sometimes you’ll want to model not only the fact that a relationship exists, but
also certain properties of the relationship—its duration or its commencement
date, for example. In this case, it’s useful to create an abstract relation
representing the relationship, as shown by the Positions relation in Figure 3-4.
This technique complicates the data model somewhat, and one might be tempted
to simply include the relationship attributes in one of the participating relations.
However, if there are a lot of attributes or a lot of relations with attributes, this
can get unwieldy. More importantly, a distinct relationship entity allows you to
track the history of a relationship. The model shown in Figure 3-4, for example,
allows you to determine an individual’s employment history, which would not
have been possible had Position been made an attribute of the Employees
relation.
Abstract relationship entities are also useful when you need to track the way a
relationship changes over time. Figure 3-5 is an example of a State Transition
diagram describing the possible legal changes in an individual’s marital status.
State Transition diagrams are not difficult to understand. Each vertical line
indicates a valid change in state. For example, an individual can go from married
to divorced and vice versa, but not from divorced to never married. Now, if all
you need to model is an individual’s current marital state, you don’t need to
implement an abstract relationship entity to make sure only valid changes are
made. But if you ever need to know that John and Mary Smith were married in
1953 and divorced in 1972, and that Mary remarried in 1975 but was widowed
in 1986, then you’ll need an abstract relationship entity to track that.
One-to-One Relationships
Perhaps the simplest type of relationship is the one-to-one relationship. If it’s
true that any instance of entity X can be associated with only one instance of
entity Y, then the relationship is one-to-one. Most IsA relationships will be oneto-one, but otherwise, examples of one-to-one relationships are fairly rare in the
problem domain. When choosing a one-to-one relationship between entities, you
need to be sure that the relationship is either true for all time or, if it does
change, that you don’t care about past values. For example, say you’re modeling
the office space in a building. Assuming that there is one person to an office,
there’s a one-to-one relationship, as shown in Figure 3-6.
But the relationship between an employee and an office is true only at a specific
moment in time. Over time, different employees will be assigned to the office.
(The arrangement of offices in the building might change as well, but that’s a
different problem.) If you use the one-to-one relationship shown in Figure 3-6,
you will have a simple, clean model of the building, but you’ll have no way of
determining the history of occupancy. You might not care. If you’re building a
system for a mail room, you need to know where to send Jane Doe’s mail today,
not where it would have been sent three months ago. But if you’re designing a
system for a property manager, you can’t lose this historical information—the
system will be asked to determine, for example, how often tenants change.
Although one-to-one relationships are rare in the real world, they’re very
common and useful abstract concepts. They’re most often used to either reduce
the number of attributes in a relation or model subclasses of entities. There is a
physical limitation of 255 fields per table if you’re using the Jet database engine,
and 250 fields per table if you’re using SQL Server. I’m suspicious—very
suspicious—of any data model that exceeds these limitations. But I have
occasionally seen systems, usually in science and medicine, where the entities
had more than 255 genuine attributes. In these cases, you have no choice but to
create a new relation with some arbitrary subset of attributes and to create a oneto-one relationship between it and the original, controlling relation.
Another problem domain that often appears to require that the physical
limitations on table size be exceeded is the modeling of tests and questionnaires.
Given a test with an arbitrary number of questions, you might be tempted to
model an individual’s responses as shown in Figure 3-7.
This structure is easy to implement, but it is not generally the best solution. The
answer attributes are a repeating group, and the relationship is therefore not in
first normal form. A better model is shown in Figure 3-8.
Subclassing Entities
A more interesting use of one-to-one relationships is for entity subclassing, a
concept borrowed from object-oriented programming. To see some of the
benefits of subclassing entities, let’s first look at a more traditional
implementation. In the Microsoft Access Northwind sample database, each
product is assigned to a product category, as shown in Figure 3-9.
Having a Categories relation allows the products to be grouped for reporting
purposes and might be all that is required by your problem space. But with this
design, you can treat a product only as a product, not as an instance of its
specific category. Any attributes defined for Products are stored for all products,
whatever their type. This isn’t a very close match to the problem domain—
Beverages intrinsically have different attributes than Condiments.
You might be tempted to model the Northwind product list as shown in Figure 3-
10. This model allows us to store all the specific information for each product
type, one type per relation, but makes it difficult to treat a product as a product.
Imagine, for example, the process of checking that a product code entered by a
user is correct: “if the code exists in the x relation, or the y relation, or…”. This
is as ugly as the repeating group query in Chapter 2. Also, you might run into
integrity problems with this structure if you have certain attributes that apply
only to one product category (UnitsPerPackage, for example, which might
pertain to Beverages but not DairyProducts) and the category of a particular
product changes. What do you do in these circumstances? Throw away the old
values? But what if the change was accidental, and the user immediately changes
it back?
Subclassing the product entity provides the best of both worlds. You can capture
information specific to certain product categories without losing the ability to
treat the products as the generic type when that’s appropriate, and you can defer
the deletion of the no-longer-applicable information until you’re certain it really
is no longer applicable. Figure 3-11 shows a model developed using entity
subclasses.
It must be said that while entity subclassing is an elegant solution to certain
types of data modeling problems, it can be an awkward beast to implement. To
take just one example, a report containing Product details would need to include
conditional processing to display only the fields appropriate to the current
subclass. This isn’t an insurmountable task by any means, but it is a
consideration. Under most circumstances, I wouldn’t recommend that you
compromise the data model to make life easier for the programmers. But there’s
certainly no point in adding the complexity of subclassing to the model if all you
need is the ability to group or categorize entities for reporting purposes; in this
situation, the structure shown in Figure 3-9 is perfectly adequate and far more
sensible.
Identifying the primary and foreign relations in a one-to-one relationship can
sometimes be tricky, as you must base the decision on the semantics of the data
model. If you’ve chosen this structure in order to subclass the entity, the generic
entity becomes the primary relation and each of the subclasses becomes a
foreign relation.
NOTE
In this situation, the foreign key that the subclasses acquire is often also
the candidate key of the subclasses. There is rarely a reason for
subclasses to have their own identifiers.
If, on the other hand, you’re using one-to-one relationships to avoid field
limitations, or the entities have a genuine one-to-one relationship in the problem
space, the choice must be somewhat arbitrary. You must choose the primary
relation based on your understanding of the problem space.
One thing that can help in this situation is the optionality of the relationship. If
the relationship is optional on one side only (and I’ve never seen a model where
it was optional on both sides), the relation on the optional side is the foreign
relation. In other words, if only one of the entities is weak and the other regular,
the regular entity is the primary relation and the weak entity is the foreign
relation.
One-to-Many Relationships
The most common type of relationship between entities is one-to-many, wherein
a single instance of one entity can be associated with zero, one, or many
instances of another entity. The majority of the normalization techniques
discussed in Chapter 2 result in relations with one-to-many relationships
between them.
One-to-many relationships present few problems once they’ve been identified.
However, it’s important to be careful in specifying the optionality on each side of
the relationship. It’s commonly thought that only the many side of the
relationship can be optional, but this isn’t the case. Take, for example, the
relationship shown in Figure 3-12.
The relationship between Client and CustomerServiceRep is optional in both
directions. In English, this would be expressed as a “CustomerServiceRep can
have zero or more clients. A client’s CustomerServiceRep, if one has been
assigned, must be present in the CustomerServiceRep relation.” Specifying
optionality on the one side of a one-to-many relationship has important
implications for both the implementation and usability of the system. We’ll
discuss these issues in detail in Chapters 4 and 14, but understand here that
relational theory does not require that the one side of a one-to-many relationship
be mandatory.
Identifying the primary and foreign relations in a one-to-many relationship is
easy. The entity on the one side of the relationship is always the primary relation;
its candidate key is copied to the relation on the many side, which becomes the
foreign relation. The key candidate of the primary relation often forms part of
the candidate key for the relation on the many side, but it can never uniquely
identify the tuples of foreign relation by itself. It must be combined with one or
more other attributes to form a candidate key
Many-to-Many Relationships
Many-to-many relationships exist aplenty in the real world. Students take many
courses; any given course is attended by many students. Customers shop in
many stores; a store has many customers, or so one hopes! But many-to-many
relationships can’t be implemented in a relational database. Instead, they are
modeled using an intermediary relation that has a one-to-many relationship with
each of the original participants, as shown in Figure 3-13. Such an intermediary
relation is usually called a junction table, even when working at the data model
level, where of course we’re talking about relations, not tables.
Since a many-to-many relationship is modeled as two one-to-many relationships,
determining the primary and foreign relations is straightforward. As we’ve seen,
the relation on the one side of a one-to-many relationship is always the primary
relation. This means that each of the original entities will become a primary
relation, and the junction table will be the foreign relation, receiving the
candidate keys of the relations on each side of it.
Junction tables most often contain only the candidate keys of the two original
participants, but they are really just a special case of the abstract relationship
entities discussed earlier. As such, they can contain whatever additional
attributes are appropriate.
Unary Relationships
All the relationships discussed so far have been binary relationships having two
participants. Unary relationships have only one participant—the relation is
associated with itself. The classic example of a unary relationship is Employee
to Manager. One’s manager is, in most cases, also an employee with a manager
of his or her own.
Unary relationships are modeled in the same way as binary relationships—the
candidate key of the primary relation is added to the foreign relation. The only
difference is that the primary and foreign relations are the same. Thus, if the
candidate key of the Employee relation is EmployeeID, declared on the
EmployeeID domain, then you’d add an attribute to the relation called perhaps
ManagerID, also declared on the EmployeeID domain, as shown in Figure 3-14
Unary relationships can be of any cardinality. One-to-many unary relationships
are used to implement hierarchies, such as the organizational hierarchy implicit
in the Employee-Manager relationship. Many-to-many unary relationships, like
their binary counterparts, must be modeled with a junction table. Unary
relationships can also be optional on the one side, as shown in Figure 3-14. The
CEOs of most organizations do not have a manager. (The stockholders don’t
count unless they’re an independent part of the data model.)
Ternary Relationships
Ternary relationships are usually in the form X does Y to Z, and like many-tomany relationships they can’t be directly modeled in a relational database. Unlike
many-to-many relationships, however, there isn’t a single recipe for modeling
them.
In Figure 3-15, we can see that the Mozzarella di Giovanni purchased by Vins et
alcools Chevalier is supplied by both Formaggi Fortini s.r.l. and Forêts d’érables,
but there is no way to determine which of them supplied the specific cheese that
was shipped to Vins et alcools Chevalier. A ternary relationship has been lost in
the data model. Suppliers don’t merely supply products, they supply products
that are purchased by specific customers.
To understand the problem, it’s useful to first examine the relationships in a more
typical problem space, as shown in Figure 3-16.
In this diagram, each product is provided by only a single supplier and the
ternary relationship is maintained—if you know the product, you know who
supplied it. In Figure 3-17, however, each product is provided by multiple
suppliers and the ternary relationship has been lost.
The key to resolving the problem is to examine the direction of the one-to-many
relationships. Given any entity on the many side, you can determine the
corresponding entity on the one side. Thus, given a specific OrderDetails item in
Figure 3-16 you can determine the Orders item to which it belongs, and knowing
the Orders item you can determine the Customers. The process works in the
other direction as well, of course: knowing the OrderDetails item, you can
determine the product and then the supplier.
But the reverse is not true. Having identified an entity on the one side of the
relationship, you can’t select a single entity on the many side. This is the problem
with Figure 3-17. Knowing an OrderDetails item, you can determine the product,
but knowing the product, you can’t determine to which SupplierProducts entity
it’s linked.
An easy way to think of this is that you can’t change directions from one-tomany to many-to-one more than once in a relationship chain. The chain in Figure
3-16 changes directions only once, at OrderDetails. The chain in Figure 3-17
changes directions twice, at OrderDetails and again at SupplierProducts.
The solution is to eliminate the Products entity from the chain, as shown in
Figure 3-18.
The chain now changes direction only once, at OrderDetails, and the relationship
has been maintained. Notice, however, that the Products entity hasn’t been
eliminated. Chances are good that orders will still be placed for Products rather
than for SupplierProducts, and maintaining the Products entity allows the user
interface to accommodate this.
Of course, you might not care about the ternary relationships in your problem
space, or there may be some other way of tracing the relationship when
necessary, such as a lot number on the packaging. In this case, there is absolutely
no need to model it. It’s important to understand that the model in Figure 3-18
isn’t intrinsically better or more correct than that in Figure 3-17. You must
choose the model that best reflects the semantics of your problem space.
Relationships of Known Cardinality
Occasionally the minimum, absolute, or maximum number of tuples on the
many side of a one-to-many relationship will be known in advance. There are
five periods in the school day, 200 bones in the adult skeleton, and a golfer is
allowed to carry only 14 clubs during tournament play.
It’s always tempting to model this situation by including the candidate key of
each tuple as an attribute in the relation on the one side, as shown in Figure 3-19,
but there are two major problems with this approach. First, it’s a repeating group,
and second, it’s unreliable.
The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.The repeating nature of the attributes is disguised in Figure 3-19 by the attribute
names, but they’re all defined on the same domain, ClassPeriod. Whenever you
have multiple attributes defined on the same domain, chances are good that you
have category or type values masquerading as attribute names.
In addition to this theoretical problem, structures like this are unreliable. It might
very well be company policy, for example, that managers have a maximum of
five employees reporting directly to them, but policy isn’t reality. By embedding
the policy in the data model, you’re implementing it as a non-negotiable system
constraint. I’d be willing to bet that you’ll discover during initial data entry that at
least one manager has six direct reports. What happens then? Will somebody
suddenly have a new boss? Will the manager get entered (and maybe paid)
twice? Or will the programmer who got a support call at 3 A.M. say some very
uncharitable things about you?
Limitations on cardinality such as these must be implemented as system
constraints; they shouldn’t be embedded in the structure of the relations
themselves. Furthermore, as we’ll discuss in Chapter 16, you should think long
and hard about the impact on usability before implementing limitations on
cardinality at all.
Understanding ER notation
Entity Relationship Diagrams
The Entity Relationship model, which describes data in terms of entities,
attributes, and relations, was introduced by Peter Pin Shan Chen in 1976.
2 At the
same time, he proposed a method of diagramming called Entity Relationship
(E/R) diagrams, which has become widely accepted. E/R diagrams use
rectangles to describe entities, ellipses for attributes, and diamonds to represent
relationships, as shown in Figure 1-6.
The nature of the relationship between entities (one-to-one, one-to-many, or
many-to-many) is represented in various ways. A number of people use 1 and M
or 1 and ¥ (representing infinity) to represent one and many. I use the “crow’s
foot” technique shown in Figure 1-6, which I find more expressive.
The great advantage of E/R diagrams is that they’re easy to draw and understand.
In practice, though, I usually diagram the attributes separately, since they exist at
a different level of detail.
Generally, one is either thinking about the entities in the model and the
relationships between them or thinking about the attributes of a given entity, but
rarely thinking about both at the same time.
———-|-
One
Understanding normalization concept
The principles of normalization discussed in the rest of this chapter are tools for
controlling the structure of data in the same way that a paper clip controls sheets
of paper. The normal forms (we’ll discuss six) specify increasingly stringent
rules for the structure of relations. Each form extends the previous one in such a
way as to prevent certain kinds of update anomalies.
Bear in mind that the normal forms are not a prescription for creating a “correct”
data model. A data model could be perfectly normalized and still fail to answer
the questions asked of it; or, it might provide the answers, but so slowly and
awkwardly that the database system built around it is unusable. But if your data
model is normalized—that is, if it conforms to the rules of relational structure—
the chances are high that the result will be an efficient, effective data model.
Before we turn to normalization, however, you should be familiar with a couple
of underlying principles.
Lossless Decomposition
The relational model allows relations to be joined in various ways by linking
attributes. The process of obtaining a fully normalized data model involves
removing redundancy by dividing relations in such a way that the resultant
relations can be recombined without losing any of the information. This is the
principle of lossless decomposition.
Using two relations eliminates the redundant addresses but still allows you to a
find customer’s address simply by looking up the CustomerNumber, which is
stored in both the Customers and Invoices recordsets.
Candidate Keys and Primary Keys
In Chapter 1, I defined a relation body as an unordered set of zero or more tuples
and pointed out that by definition each member of a set is unique. This being the
case, for any relation there must be some combination of attributes that uniquely
identifies each tuple. This set of one or more attributes is called a candidate key.
There might be more than one candidate key for any given relation, but it must
always be the case that each candidate key uniquely identifies each tuple, not
just for any specific set of tuples but for all possible tuples for all time. The
inverse of this principle must also be true, by the way. Given any two tuples with
the same candidate key, both tuples must represent the same entity. The
implication of this statement is that you cannot determine a candidate key by
inspection. Just because some field or combination of fields is unique for a given
set of tuples, you cannot guarantee that it will be unique for all tuples, which it
must be to qualify as a candidate key. Once again, you must understand the
semantics of the data model.
Consider the Invoices relation shown in Figure 2-9. The CustomerID is unique
in the example, but it’s extremely unlikely that it will remain that way—and
almost certainly not the intention of the company! Despite appearances, the
semantics of the model tell us that this field is not a candidate key.
By definition, all relations must have at least one candidate key: the set of all
attributes comprising the tuple. Candidate keys can be composed of a single
attribute (a simple key) or of multiple attributes (a composite key). However, an
additional requirement of a candidate key is that it must be irreducible, so the set
of all attributes is not necessarily a candidate key. In the relation shown in Figure
2-10, the attribute CategoryID is a candidate key, but the set {CategoryID,
CategoryName}, although it is unique, is not a candidate key, since the
CategoryName attribute is unnecessary.
It is sometimes the case—although it doesn’t happen often—that there are
multiple possible candidate keys for a relation. In this case, it is customary to
designate one candidate key as a primary key and consider other candidate keys
alternate keys. This is an arbitrary decision and isn’t very useful at the logical
level. (Remember that the data model is purely abstract.) To help maintain the
distinction between the model and its physical implementation, I prefer to use
the term “candidate key” at the data model level and reserve “primary key” for
the implementation.
NOTE
When the only possible candidate key is unwieldy—it requires too
many fields or is too large, for example—you can use a data type that
both the Microsoft Jet database engine and Microsoft SQL Server
provide for creating artificial keys with values that will be generated by
the system. Called AutoNumber fields in Microsoft Jet and Identity
fields in SQL Server, fields based on this data type are terrifically useful
tools, provided you don’t try to make them mean anything. They’re just
tags. They aren’t guaranteed to be sequential, you have very little
control over how they’re generated, and if you try to use them to mean
anything you’ll cause more problems than you solve.
Although choosing candidate keys is a semantic process, don’t assume that the
attributes you use to identify an entity in the real world will make an appropriate
candidate key. Individuals, for example, are usually referred to by their names,
but a quick look at any phone book will establish that names are hardly unique.
Of course, the name must provide a candidate key when combined with some
other set of attributes, but this can be awkward to determine. I once worked in an
office with about 20 people, of whom two were named Larry Simon and one was
named Lary Simon. They were “Short Lary,” “German Larry,” and “Blond
Larry”; that’s height, nationality, and hair color combined with name, hardly a
viable candidate key. In situations like this, it’s probably best to use a systemgenerated ID number, such as an AutoNumber or Identity field, but remember,
don’t try to make it mean anything!
Functional Dependency
The concept of functional dependency is an extremely useful tool for thinking
about data structures. Given any tuple T, with two sets of attributes {X1…Xn}
and {Y1…Yn} (the sets need not be mutually exclusive), then set Y is
functionally dependent on set X if, for any legal value of X, there is only one
legal value for Y.
For example, in the relation shown in Figure 2-10 every tuple that has the same
values for {CategoryID} will have the same value for {CategoryName,
Description}. We can therefore say that the attribute CategoryID functionally
determines {CategoryName, Description}. Note that functional dependency
doesn’t necessarily work the other way: knowing a value for {CategoryName,
Description} won’t allow us to determine the corresponding value for
{CategoryID}.
You can indicate the functional dependency between sets of attributes as shown
in Figure 2-11. In text, you can express functional dependencies as X Y,
which reads “X functionally determines Y.”
Functional dependency is interesting to academics because it provides a
mechanism for developing something that starts to resemble a mathematics of
data modeling. You can, for example, discuss the reflexivity and transitivity of a
functional dependency if you are so inclined.
In practical application, functional dependency is a convenient way of
expressing what is a fairly self-evident concept: given any relation, there will be
some set of attributes that is unique to each tuple, and knowing those it is
possible to determine those attributes that are not unique.
If {X} is a candidate key, then all attributes {Y} must necessarily be functional
dependent on {X}; this follows from the definition of candidate key. If {X} is
not a candidate key and the functional dependency is not trivial (that is, {Y} is
not a subset of {X}), then the relation will necessarily involve some redundancy,
and further normalization will be required. In a sense, data normalization is the
process of ensuring that, as shown in Figure 2-11, all the arrows come out of
candidate keys
First Normal Form
A relation is in first normal form if the domains on which its attributes are
defined are scalar. This is at once both the simplest and most difficult concept in
data modeling. The principle is straightforward: each attribute of a tuple must
contain a single value. But what constitutes a single value? In the relation shown
in Figure 2-12, the Items attribute obviously contains multiple values and is
therefore not in first normal form. But the issue is not always so clear cut.
We saw some of the problems involved in determining whether an attribute is
scalar when we looked at the modeling of names and addresses in Chapter 1.
Dates are another tricky domain. They consist of three distinct components: the
day, the month, and the year. Ought they be stored as three attributes or as a
composite? As always, the answer can be determined only by looking to the
semantics of the problem space you’re modeling.
If your system uses the date exclusively, or even primarily, as a single value, it is
scalar. But if your system must frequently manipulate the individual components
of the date, you might be better off storing them as separate attributes. You might
not care about the day, for example, but only the month and year. Or you might
care only about the month and day, but not the year. This isn’t often the case, but
it happens.
Date arithmetic is tedious to perform, so you’ll most often use an attribute
defined on the DateTime data type to offload the majority of the work to the
development environment. However, this can get you into trouble if you’re doing
comparisons on a single component. This is particularly true if you’re ignoring
the time component of a field. For example, if you set the value of a DateCreated
field to the result of the VBA function Now, which returns both date and time,
and then later attempt to compare it to the value returned by Date(), which
returns the date only, you might get incorrect results. Even if you never display
the time to users, it is being stored, and obviously “1/1/1999 12:30:19 AM” isn’t
the same as “1/1/1999”.
Another place people frequently have problems with nonscalar values is with
codes and flags. Many companies assign case numbers or reference numbers that
are calculated values, usually something along the lines of REF0010398, which
might indicate that this is the first case opened in March 1998. While it’s
unlikely that you’ll be able to alter company policy, it’s not a good idea to
attempt to manipulate the individual components of the reference number in
your data model.
It’s far easier in the long run to store the values separately: {Ref#, Case#, Month,
Year}. This way, determining the next case number or the number of cases
opened in a given year becomes a simple query against an attribute and doesn’t
require additional manipulation. This has important performance implications,
particularly in client/server environments, where extracting a value from the
middle of an attribute might require that each individual record be examined
locally rather than offloaded to the database server.
Another type of nonscalar attribute that causes problems for people is the bit
flag. In conventional programming environments, it’s common practice to store
sets of Boolean values as individual bits in a word, and then to use bitwise
operations to check and test them. Windows API programming relies heavily on
this technique, for example. In conventional programming environments, this is
a perfectly sensible thing to do. In relational data models, it is not. Not only does
the practice violate first normal form, but you can’t actually do it, since neither
the Jet database engine nor SQL Server versions of the SQL language provide
bitwise operators. You can accomplish this using custom functionality in the
database application, but only in Microsoft Access (Microsoft Visual Basic
doesn’t support custom functions in queries), and it forces the query to be
processed locally.
Unfortunately, this is the kind of constraint that frequently gets imposed on you
for historical reasons, but if you’ve any choice in the matter, don’t encode more
than one piece of information in a single attribute. If you’re using legacy
information, you can always unpack the data and store both versions in the
recordset.
There’s another kind of nonscalar value to be wary of when checking a relation
for first normal form: the repeating group. Figure 2-13 shows an Invoice
relation. Someone, at some point, decided that customers are not allowed to buy
more than 5 items. I wonder if they checked that decision with the sales manager
first? Seriously, this is almost certainly an artificial constraint imposed by the
system, not the business. Artificial system constraints are evil, and in this case,
just plain wrong as well.
Another example of a repeating group is shown in Figure 2-14. This isn’t as
obvious an error, and many successful systems have been implemented using a
model similar to this. But this is really just a variation of the structure shown in
Figure 2-13 and has the same problems. Imagine the query to determine which
products exceeded target by more than 10 percent any time in the first quarter.
Second Normal Form
A relation is in second normal form if it is in first normal form and in addition all
its attributes are dependent on the entire candidate key. The key in Figure 2-15,
for example, is {ProductName, SupplierName}, but the SupplierPhoneNumber
field is dependent only on the SupplierName, not on the full composite key.
We’ve already seen that this causes redundancy, and that the redundancy can, in
turn, result in unpleasant maintenance problems. A better model would be that
shown in Figure 2-16.
Logically, this is an issue of not trying to represent two distinct entities, Products
and Suppliers, in a single relation. By separating the representation, you’re not
only eliminating the redundancy, you’re also providing a mechanism for storing
information that you couldn’t otherwise capture. In the example in Figure 2-16, it
becomes possible to capture information about Suppliers before obtaining any
information regarding their products. That could not be done in the first relation,
since neither component of a primary key can be empty.
The other way that people get into trouble with second normal form is in
confusing constraints that happen to be true at any given moment with those that
are true for all time. The relation shown in Figure 2-17, for example, assumes
that a supplier has only one address, which might be true at the moment but will
not necessarily remain true in the future.
Third Normal Form
A relation is in third normal form if it is in second normal form and in addition
all nonkey attributes are mutually independent. Let’s take the example of a
company that has a single salesperson in each state. Given the relation shown in
Figure 2-18, there is a dependency between Region and Salesperson, but neither
of these attributes is reasonably a candidate key for the relation.
It’s possible to get really pedantic about third normal form. In most places, for
example, you can determine a PostalCode value based on the City and Region
values, so the relation shown in Figure 2-19 is not strictly in third normal form.
The two relations shown in Figure 2-20 are technically more correct, but in
reality the only benefit you’re gaining is the ability to automatically look up the
PostalCode when you’re entering new records, saving users a few keystrokes.
This isn’t a trivial benefit, but there are probably better ways to implement this
functionality, ones that don’t incur the overhead of a relation join every time the
address is referenced.
As with every other decision in the data modeling process, when and how to
implement third normal form can only be determined by considering the
semantics of the model. You’ll create a separate relation only when the entity is
important to the model, or the data changes frequently, or you’re certain there are
technical implementation advantages. Postal codes do change, but not often; and
they aren’t intrinsically important in most systems.
Further Normalization
The first three normal forms were included in Codd’s original formulation of
relational theory, and in the vast majority of cases they’re all you’ll need to worry
about. Just remember the jingle I learned in grad school: “The key, the whole
key, and nothing but the key, so help me Codd.”
The further normal forms—Boyce/Codd, fourth, and fifth—have been developed
to handle special cases, most of which are rare.
Boyce/Codd Normal Form
Boyce/Codd normal form, which is considered a variation of third normal form,
handles the special case of relations with multiple candidate keys. In fact, for
Boyce/Codd normal form to apply, the following conditions must hold true:
- The relation must have two or more candidate keys.
- At least two of the candidate keys must be composite.
- The candidate keys must have overlapping attributes.
The easiest way to understand Boyce/Codd normal form is to use functional
dependencies. Boyce/Codd normal form states, essentially, that there must be no
functional dependencies between candidate keys. Take, for example, the relation
shown in Figure 2-21. The relation is in third normal form (assuming supplier
names are unique), but it still contains significant redundancy.
The two candidate keys in this case are {SupplierID, ProductID} and
{SupplierName, ProductID}, and the functional dependency diagram is shown in
Figure 2-22.
As you can see, there is a functional dependency {SupplierID}
{SupplierName}, which is in violation of Boyce/Codd normal form. A correct
model is shown in Figure 2-23.
Violations of Boyce/Codd normal form are easy to avoid if you pay attention to
what the relation is logically about. If Figure 2-21 is about Products, the supplier
information shouldn’t be there (and vice versa, of course).
Fourth Normal Form
Fourth normal form provides a theoretical foundation for a principle that is
intuitively obvious: independent repeating groups should not be combined in a
single relation. By way of example, let’s assume that the own-brand products
sold by Northwind Traders come in multiple package sizes, that they are sourced
from multiple suppliers, and that all suppliers provide all pack sizes. A
completely unnormalized version of the Products relation might look like Figure
2-24.
Now, the first step in normalizing this relation is to eliminate the nonscalar
PackSize attribute, resulting in the relation shown in Figure 2-25.
Surprisingly, Figure 2-25 is in Boyce/Codd normal form, since it is “all key.” But
there are clearly redundancy problems, and maintaining data integrity could be a
nightmare. The resolution to these problems lies in the concept of multivalued
dependency pairs and fourth normal form.
A multivalued dependency pair is two mutually independent sets of attributes. In
Figure 2-24, the multivalued dependency is {ProductName} {PackSize}|
{SupplierName}, which is read “Product multidetermines PackSize and
Supplier.” Fourth normal form states, informally, that multivalued dependencies
must be divided into separate relations, as shown in Figure 2-26. Formally, a
relation is in fourth normal form if it is in Boyce/Codd normal form and in
addition all the multivalued dependencies are also functional dependencies out
of the candidate keys.
The important thing to understand about fourth normal form is that it comes into
play only if there are multiple values for the attributes. If each product in the
example above had only a single pack size or a single supplier, fourth normal
form would not apply. Similarly, if the two sets of attributes are not mutually
independent, the relation is most likely in violation of second normal form
Fifth Normal Form
Fifth normal form addresses the extremely rare case of join dependencies. A join
dependency expresses the cyclical constraint “if Entity1 is linked to Entity2, and
Entity2 is linked to Entity3, and Entity3 is linked back to Entity1, then all three
entities must necessarily coexist in the same tuple.”
To translate this into something resembling English, it would mean that if
{Supplier} supplies {Product}, and {Customer} ordered {Product}, and
{Supplier} supplied something to {Customer}, then {Supplier} supplied
{Product} to {Customer}. Now, in the real world this is not a valid deduction.
{Supplier} could have supplied anything to {Customer}, not necessarily
{Product}. A join dependency exists only if there is an additional constraint that
states that the deduction is valid.
It is not sufficient, in this situation, to use a single relation with the attributes
{Supplier, Product, Customer} because of the resulting update problems. Given
the relationship shown in Figure 2-27, for example, inserting the tuple {“Ma
Maison”, “Aniseed Syrup”, “Berglunds snabbköp”} requires the insertion of a
second tuple, {“Exotic Liquids”, “Aniseed Syrup”, “Berglunds snabbköp”}, since
a new link, “Aniseed Syrup” to “Berglunds snabbköp” has been added to the
model.
Decomposing the relation into three distinct relations (SupplierProduct,
ProductCustomer, and SupplierCustomer) eliminates this problem but causes
problems of its own; in re-creating the original relation, all three relations must
be joined. Interim joins of only two relations will result in invalid information.
From a system designer’s point of view, this is a terrifying situation, since there’s
no intrinsic method for enforcing the three-table join except through security
restrictions. Further, if a user should create an interim result set, the results will
seem perfectly reasonable, and it’s unlikely that the user will be able to detect the
error by inspection.
Fortunately, this cyclical join dependency is so rare that the complications can be
safely ignored in most situations. Where they can’t, your only resource is to build
the database system in such a way as to ensure the integrity of the multirelation
join.
Data Integrity
Data Integrity
Creating a model of the entities in the problem space and the relationships
between them is only part of the data modeling process. You must also capture
the rules that the database system will use to ensure that the actual physical data
stored in it is, if not correct, at least plausible. In other words, you must model
the data integrity.
It’s important to understand that the chances of being able to guarantee the
literal correctness of the data are diminishingly small. Take, for example, an
order record showing that Mary Smith purchased 17 hacksaws on July 15, 1999.
The database system can ensure that Mary Smith is a customer known to the
system, that the company does indeed sell hacksaws, and that it was taking
orders on July 15, 1999. It can even check that Mary Smith has sufficient credit
to pay for the 17
hacksaws. What it can’t do is verify that Ms. Smith actually ordered 17
hacksaws and not 7 or 1, or 17 screwdrivers instead. The best the system might
do is notice that 17 is rather a lot of hacksaws for an individual to purchase and
notify the user to that effect. Even having the system do this much is likely to be
expensive to implement, probably more expensive than its value warrants.
My point is that the system can never verify that Mary Smith did place the order
as it’s recorded; it can verify only that she could have done so. Of
course, that’s all any record-keeping system can do, and a well-designed
database system can certainly do a better job than the average manual system—if
for no other reason than its consistency in applying the rules. But no database
system, and no database system designer, can guarantee that the data in the
database is true, only that it could be true. It does this by ensuring that the data
complies with the integrity constraints that have been defined for it.
Integrity Constraints
Some people refer to integrity constraints as business rules. However, business
rules is a much broader concept; it includes all of the constraints on the system
rather than just the constraints concerning the integrity of the data. In particular,
system security—that is, the definition of which users can do what and under
what circumstances they can do it—is part of system administration, not data
integrity. But certainly security is a business requirement and will constitute one
or more business rules. We’ll look at database security issues in Chapter 8 when
we discuss system administration.
Data integrity is implemented at several levels of granularity. Domain, transition,
and entity constraints define the rules for maintaining the integrity of the
individual relations. Referential integrity constraints ensure that necessary
relationships between relations are maintained. Database integrity constraints
govern the database as a whole, and transaction integrity constraints control the
way data is manipulated either within a single database or between multiple
databases.
Domain Integrity
As we discussed in Chapter 1, a domain is the set of all possible values for a
given attribute. A domain integrity constraint—usually just called a domain
constraint—is a rule that defines these legal values. It might, of course, be
necessary to define more than one domain constraint to describe a domain
completely.
A domain isn’t the same thing as a data type, and defining domains in terms of
physical data types can backfire. The danger is that you will unnecessarily
constrain the values—for example, by choosing an integer because you think it
will be big enough rather than because 255 is the largest permitted value for the
domain.
That being said, however, data types can be a convenient shorthand in the data
model, and for this reason choosing a logical data type is often the first step in
determining the domain constraints in a system. By logical data type, I mean
“date,” “string,” or “image,” nothing more specific than that. Dates are probably
the best example of the benefits of this approach. I recommend against defining
the domain TransactionDate as DateTime, which is a physical representation.
However, defining it as “a date” allows you to concentrate on it being “a date
between the commencement of business and the present date, inclusive” and
ignore all those rather tedious rules about leap years.
Having chosen a logical data type, it might be appropriate to define the scale and
precision of a numeric type, or the maximum length of string values. This is very
close to specifying a physical data type, but you should still be working at the
logical level. Obviously, you will not be hit by lightning if your particular
shorthand for “a string value of no more than 30 characters” is char(30). But the
more abstract you keep the description in the data model, the more room you’ll
have to maneuver later and the less likely you’ll be to impose accidental
constraints on the system.
The next aspect of domain integrity to consider is whether a domain is permitted
to contain unknown or nonexistent values. The handling of these values is
contentious, and we’ll be discussing them repeatedly as we examine various
aspects of database system design. For now, it’s necessary to understand only
that there is a difference between an unknown value and a nonexistent value, and
that it is often (although not always) possible to specify whether either or both of
these is permitted for the domain.
The first point here, that “unknown” and “nonexistent” are different, doesn’t
present too many problems at the logical level. (And please remember, always,
that a data model is a logical construct.) My father does not have a middle name;
I do not know my next-door neighbor’s. These are quite different issues. Some
implementation issues need not yet concern us, but the logical distinction is quite
straightforward.
The second point is that, having determined whether a domain is allowed to
include unknown or nonexistent values, you’ll need to decide whether either of
these values can be accepted by the system. To return to our TransactionDate
example, it’s certainly possible for the date of a transaction to be unknown, but if
it occurred at all it occurred at some fixed point in time and therefore cannot be
nonexistent. In other words, there must be a transaction date; we just might not
know it.
Now, obviously, we can be ignorant of anything, so any value can be unknown.
That’s not a useful distinction. What we’re actually defining here is not so much
whether a value can be unknown as whether an unknown value should be stored.
It might be that it’s not worth storing data unless the value is known, or it might
be that we can’t identify an entity without knowing its value. In either case, you
would prevent a record containing an unknown value in the specified field from
being added to the database.
This decision can’t always be made at the domain level, but it’s always worth
considering since doing so can make the job a little easier down the line. To
some extent, your decision depends on how generic your domains are. As an
example, say that you have defined a Name domain and declared the attributes
GivenName, MiddleName, Surname, and CompanyName against it. You might
just as well have defined these attributes as separate domains, but there are some
advantages to using the more general domain definition since doing so allows
you to capture the overlapping rules (and in this case, there are probably a lot of
them) in a single place. However, in this case you won’t be able to determine
whether empty or unknown values are acceptable at the domain level; you will
have to define these properties at the entity level.
The final aspect of domain integrity is that you’ll want to define the set of values
represented by a domain as specifically as possible. Our TransactionDate
domain, for example, isn’t just the set of all dates; it’s the set of dates from the
day the company began trading until the current date. It might be further
restricted to eliminate Sundays, public holidays, and any other days on which the
company does not trade.
Sometimes you’ll be able to simply list the domain values. The domain of
Weekends is completely described by the set {“Saturday”, “Sunday”}.
Sometimes it will be easier to list one or more rules for determining
membership, as we did for TransactionDate. Both techniques are perfectly
acceptable, although a specific design methodology might dictate a particular
method of documenting constraints. The important thing is that the constraints
be captured as carefully and completely as possible.
Transition Integrity
Transition integrity constraints define the states through which a tuple can
validly pass. The State-Transition diagram in Figure 4-1, for example, shows the
states through which an order can pass.
You would use transitional integrity constraints, for instance, to ensure that the
status of a given order never changed from “Entered” to “Completed” without
passing through the interim states, or to prevent a canceled order from changing
status at all.
The status of an entity is usually controlled by a single attribute. In this case,
transition integrity can be considered a special type of domain integrity.
Sometimes, however, the valid transitions are controlled by multiple attributes or
even multiple relations. Because transition constraints can exist at any level of
granularity, it’s useful to consider them a separate type of constraint during
preparation of the data model.
For example, the status of a customer might only be permitted to change from
“Normal” to “Preferred” if the customer’s credit limit is above a specified value
and he or she has been doing business with the company for at least a year. The
credit limit requirement would most likely be controlled by an attribute of the
Customers relation, but the length of time the customer has been doing business
with the company might not be explicitly stored anywhere. It might be necessary
to calculate the value based on the oldest record for the customer in the Orders
relation.
Entity Integrity
Entity constraints ensure the integrity of the entities being modeled by the
system. At the simplest level, the existence of a primary key is an entity
constraint that enforces the rule “every entity must be uniquely identifiable.”
In a sense, this is the entity integrity constraint; all others are technically entitylevel integrity constraints. The constraints defined at the entity level can govern
a single attribute, multiple attributes, or the relation as a whole.
The integrity of an individual attribute is modeled first and foremost by defining
the attribute against a specific domain. An attribute within a relation inherits the
integrity constraints defined for its domain. At the entity level, these inherited
constraints can properly be made more rigorous but not relaxed. Another way of
thinking about this is that the entity constraint can specify a subset of the domain
constraints but not a superset. For example, an OrderDate attribute defined
against the TransactionDate domain might specify that the date must be in the
current year, whereas the TransactionDate domain allows any date between the
date business commenced and the current date. An entity constraint should not,
however, allow OrderDate to contain dates in the future, since the attribute’s
domain prohibits these.
Similarly, a CompanyName attribute defined against the Name domain might
prohibit empty values, even though the Name domain permits them. Again, this
is a narrower, more rigorous definition of permissible values than that specified
in the domain.
NOTE
Designers often specify the validity of empty and unknown values at
the entity level rather than the domain level. In fact, some designers
would argue that these constraints apply only at the entity level. There
is some justification for this position, but I recommend making the
domain definition as complete as possible. Certainly, considering empty
and unknown values at the domain level does no harm and can make
the process of specification (and implementation) simpler.
In addition to narrowing the range of values for a single attribute, entity
constraints can also effect multiple attributes. A requirement that ShippingDate
be on or after OrderDate is an example of such a constraint. Entity constraints
can’t reference other relations, however. It wouldn’t be appropriate, for example,
to define an entity constraint that sets a customer DiscountRate (an attribute of
the Customer relation) based on the customer’s TotalSales (which is based on
multiple records in the OrderItems relation). Constraints that depend on multiple
relations are database-level constraints; we’ll discuss them later in this chapter.
Be careful of multiple-attribute constraints; they might indicate that your data
model isn’t fully normalized. If you are restricting or calculating the value of one
attribute based on another, you’re probably OK. An entity constraint that says
“Status is not allowed to be ‘Preferred’ unless the Customer record is at least one
year old” would be fine. But if the value of one attribute determines the value of
another—for example, “If the customer record is older than one year, then Status
= ‘Preferred’“—then you have a functional dependency and you’re in violation of
third normal form
Referential Integrity
In Chapter 3, we looked at the decomposition of relations to minimize
redundancy and at foreign keys to implement the links between relations. If
these links are ever broken, the system will be unreliable at best and unusable at
worst. Referential integrity constraints maintain and protect these links.
There is really only one referential integrity constraint: foreign keys cannot
become orphans. In other words, no record in the foreign table can contain a
foreign key that doesn’t match a record in the primary table. Tuples that contain
foreign keys that don’t have a corresponding candidate key in the primary
relation are called orphan entities. There are three ways orphan entities can be
created:
- A tuple is added to the foreign table with a key that does not match a
candidate key in the primary table.
- The candidate key in the primary table is changed.
- The referenced record in the primary table is deleted.
All three of these cases must be handled if the integrity of a relationship is to be
maintained. The first case, the addition of an unmatched foreign key, is usually
simply prohibited. But note that unknown and nonexistent values don’t count. If
the relationship is declared as optional, any number of unknown and nonexistent
values can be entered without compromising referential integrity.
The second cause of orphaned entities—changing the candidate key value in the
referenced table—shouldn’t occur very often. In fact, I would strongly
recommend that changes to candidate keys be prohibited altogether wherever
possible. (This would be an entity constraint, by the way: “Candidate keys are
not allowed to change.”) But if your model does allow candidate keys to be
changed, you must ensure that these changes are made in the foreign keys as
well. This is known as a cascading update. Both Microsoft Jet and Microsoft
SQL Server provide mechanisms for easily implementing cascading updates.
The final cause of orphan foreign keys is the deletion of the tuple containing the
primary entity. If one deletes a Customer record, for example, what becomes of
that customer’s orders? As with candidate key changes, you can simply prohibit
the deletion of tuples in the primary relation if they are referenced in a foreign
relation. This is certainly the cleanest solution if it is a reasonable restriction for
your system. For when it’s not, both the Jet database engine and SQL Server
provide a simple means of cascading the operation, known as a cascading delete.
But in this case, you also have a third option that’s a little harder to implement.
At any rate, it can’t be implemented automatically. You might want to reassign
the dependent records. This isn’t often appropriate, but it is sometimes necessary.
Say, for example, that CustomerA purchases CustomerB. It might make sense to
delete CustomerB and reassign all of CustomerB’s orders to CustomerA.
A special kind of referential integrity constraint is the maximum cardinality issue
discussed in Chapter 3. In the data model, rules such as “Managers are allowed
to have a maximum of five individuals reporting to them” are defined as
referential constraints.
Database Integrity The most general form of integrity constraint is the database constraint. Database constraints reference more than one relation: "A Customer is not allowed to have a status of 'Preferred' unless he or she has made a purchase in the last 12 months." The majority of database constraints take this form. It's always a good idea to define integrity constraints as completely as possible, and database integrity is no exception. You must be careful, however, not to confuse a database constraint with the specification of a work process. A work process is something that is done with the database, such as adding an order, whereas an integrity constraint is a rule about the contents of the database. The rules that define the tasks that are performed using the database are work process constraints, not database constraints. Work processes, as we'll see in Chapter 8, can have a major impact on the data model, but they shouldn't be made a part of it. It isn't always clear whether a given business rule is an integrity constraint or a work process (or something else entirely). The difference might not be desperately important. All else being equal, implement the rule where it's most convenient to do so. If it's a straightforward process to express a rule as a database constraint, do so. If that gets tricky (as it often can, even when the rule is clearly an integrity constraint), move it to the front-end processing, where it can be implemented procedurally. On the other hand, if the rule is extremely volatile and subject to frequent change, it will probably be easier to maintain if it's part of the database schema, where a single change will effect (but hopefully not break) all the systems referencing it
Transaction Integrity
The final form of database integrity is transaction integrity. Transaction integrity
constraints govern the ways in which the database can be manipulated. Unlike
other constraints, transaction constraints are procedural and thus are not part of
the data model per se.
Transactions are closely related to work processes. The concepts are, in fact,
orthogonal, inasmuch as a given work process might consist of one or more
transactions and vice versa. It isn’t quite correct, but it’s useful to think of a work
process as an abstract construct (“add an order”) and a transaction as a physical
one (“update the OrderDetail table”).
A transaction is usually defined as a “logical unit of work,” which I’ve always
found to be a particularly unhelpful bit of rhetoric. Essentially, a transaction is a
group of actions, all of which (or none of which) must be completed. The
database must comply with all of the defined integrity constraints before the
transaction commences and after it’s completed, but might be temporarily in
violation of one or more constraints during the transaction.
The classic example of a transaction is the transfer of money from one bank
account to another. If funds are debited from Account A but the system fails to
credit them to Account B, money has been lost. Clearly, if the second command
fails, the first must be undone. In database parlance, it must be “rolled back.”
Transactions can involve multiple records, multiple relations, and even multiple
databases. To be precise, all operations against a database are transactions. Even
updating a single existing record is a transaction. Fortunately, these low-level
transactions are performed transparently by the database engine, and you can
generally ignore this level of detail.
Both the Jet database engine and SQL Server provide a means of maintaining
transactional integrity by way of the BEGIN TRANSACTION, COMMIT
TRANSACTION, and ROLLBACK TRANSACTION statements. As might be
expected, SQL Server’s implementation is more robust and better able to recover
from hardware failure as well as certain kinds of software failure. However,
these are implementation issues and are outside the scope of this book. What is
important from a design point of view is to capture and specify transaction
dependencies, those infamous “logical units of work.”
Implementing Data Integrity
Up until now, we’ve concentrated on capturing the problem space at an abstract
level in the conceptual data model. In this section, we’ll look at a few of the
issues involved in creating the physical model of the problem space: the database
schema. Moving from one level to another is primarily a change in terminology
—relations become tables and attributes become fields—except for issues of
data integrity. These never map quite as cleanly as one wants.
Unknown and Nonexistent Values (Again)
Earlier in this chapter, I somewhat blithely declared that domains and attributes
should be examined to determine whether they are permitted to be empty or
unknown without considering how these constraints might be implemented. The
implementation issue (and it is an issue) can’t be avoided once we turn to the
database schema.
The soi-disant “missing information problem” has been acknowledged since the
relational model was first proposed. How does one indicate that any given bit of
information is either missing (the customer does have a surname, we just don’t
know what it is) or nonexistent (the customer doesn’t have a middle name)?
Most relational databases, including Microsoft Jet databases and SQL Server
databases, have incorporated the null as a way of handling missing and
nonexistent values.
To call the null a solution to the issue is probably excessive, as it has numerous
problems. Some database experts reject nulls entirely. C. J. Date declares that
they “wreck the model,”
1 and I’ve lost track of how many times I’ve heard them
declared “evil.” Any remarks about the complexity of handling nulls or rueful
admissions to having been caught out by them will result in remarks along the
lines of “Good. You shouldn’t be using them.
They should hurt.”
As an alternative, the “nulls are evil” school recommends the use of specific
values of the appropriate domain to indicate unknown or nonexistent values, or
both. I think of this as the conventional value approach. The conventional value
approach has several problems. First, in many instances the chosen value is only
conventional. A date of 9/9/1900 doesn’t really mean the date is unknown, we
just agree that’s what we’ll interpret it to mean. I fail to see that this approach is
an improvement over the null. A null is a conventional value as well, of course,
but it can’t be confused with anything else, and it has the advantage of being
supported by the relational model and most relational database engines.
The second, and to my mind disqualifying, problem with the conventional value
approach is its impact on referential integrity. Take, for example, an optional
relationship between a Customer and a Customer Service Representative (CSR),
such that the CSR, if one is assigned, must be listed in the CSR table. The
conventional value approach requires that a record be added to the CSR table to
match the conventional value chosen to indicate that no CSR is assigned, as
shown in Figure 4-2
Now, how many CSRs does the company employ? One less than the number of
CSRs listed in the table, since one of them is a dummy record. Oops. What’s the
average number of customers per CSR? The number of records in the Customer
table minus the number of records that match the “UNASSIGNED” CSR,
divided by one less than the number of records in the CSR table. Double oops.
Conventional values are useful, however, when you’re producing reports.
For example, you might want to substitute “Unknown” for Null values and “Not
Applicable” for empty values. This is, of course, a very different proposition
from storing these conventional values in the database, where they interfere with
data manipulation, as we’ve seen.
Evil the null might be, and ugly it most assuredly is, but it’s the best tool we have
for handling unknown and nonexistent values. Just think the issue through, find
alternatives where that’s reasonable, and allow for the difficulties of using nulls
where alternatives are not reasonable.
One of the problems with nulls is that, with the exception of domains declared to
be string or text data types, they might be forced to do double duty. A field
declared as a DateTime data type can accept only dates or nulls. If the
corresponding attribute is defined as allowing both unknown and nonexistent
values, and both are represented by null, there is no way to determine whether a
null in any specific record represents “unknown” or “nonexistent.” This problem
doesn’t arise for string or text data types, since you can use an empty, zero-length
string for the empty value, leaving null to represent the unknown value.
In practice, this problem doesn’t occur as often as one might expect. Few nontext
domains permit the nonexistent value, so in these domains a null can always be
interpreted as meaning unknown. For those domains that do accept a nonexistent
value, a sensible alternative can often be chosen to represent it. Note that I’m
recommending an actual value here, not a conventional one. For example, even
though a Product relation has a Weight attribute, a Service Call attribute, which
obviously doesn’t have a weight, can use the value zero. (Zero is a good choice
to represent empty for many, but not all, numeric fields.)
The second and far more serious problem with nulls is that they complicate data
manipulation. Logical comparisons become more complex, and posing certain
kinds of questions can get a bit hairy. We’ll look at this in detail in Chapter 5.
I don’t take nulls lightly, and when there’s a reasonable alternative, I’d
recommend taking it. But as I’ve said elsewhere, and it bears repeating, don’t
dent the data model just to make life easier for the programmers. Think it
through, but if the system requires nulls, use them.
Declarative and Procedural Integrity
Relational database engines provide integrity support in two ways: declarative
and procedural. Declarative integrity support is explicitly defined (“declared”) as
part of the database schema. Both the Jet database engine and SQL Server
provide some declarative integrity support. Declarative integrity is the preferred
method for implementing data integrity. You should use it wherever possible.
SQL Server implements procedural integrity support by way of trigger
procedures that are executed (“triggered”) when a record is either inserted,
updated, or deleted. The Jet database engine does not provide triggers or any
other form of procedural integrity. When an integrity constraint cannot be
implemented using declarative integrity it must be implemented in the front end.
We’ll be looking at the specifics of mapping the integrity constraints defined in
the data model to the physical database schema in the rest of this chapter.
Domain Integrity
SQL Server provides a limited kind of support for domains in the form of userdefined data types (UDDTs). Fields defined against a UDDT will inherit the data
type declaration as well as domain constraints defined for the UDDT.
Equally importantly, SQL Server will prohibit comparison between fields
declared against different UDDTs, even when the UDDTs in question are based
on the same system data type. For example, even though the CityName domain
and the CompanyName domain are both defined as being char(30), SQL Server
would reject the expression CityName =
CompanyName. This can be explicitly overridden by using the convert function
CityName = CONVERT(char(30), CompanyName), but it’s good that you have
to think about it before comparing fields declared against different domains since
these comparisons don’t often make sense.
UDDTs can be created either through the SQL Server Enterprise Manager or
through the system stored procedure sp_addtype. Either way, UDDTs are
initially declared with a name or a data type and by whether they are allowed to
accept nulls. Once a UDDT
has been created, default values and validation rules can be defined for it. A SQL
Server rule is a logical expression that defines the acceptable values for the
UDDT (or for a field, if it is bound to a field rather than a UDDT). A default is
simply that, a default value to be inserted by the system into a field that would
otherwise be null because the user did not provide a value.
Binding a rule or default to a UDDT is a two-step procedure. First you must
create the rule or default, and then bind it to the UDDT (or field). The “it’s not a
bug, it’s a feature” justification for this two-step procedure is that, once defined,
the rule or default can be reused elsewhere. I find this tedious since in my
experience these objects are reused only rarely. When defining a table, SQL
Server provides the ability to declare defaults and CHECK constraints directly,
as part of the table definition. (CHECK constraints are similar to rules, but more
powerful.) Unfortunately this one-step declaration is not available when
declaring UDDTs, which must use the older “create-then-bind” methodology. It
is heartily to be wished that Microsoft add support for default and CHECK
constraint declarations to UDDTs in a future release of SQL Server.
A second way of implementing a kind of deferred domain integrity is to use
lookup tables. This technique can be used in both Microsoft Jet and SQL Server.
As an example, take the domain of USStates. Now theoretically you can create a
rule listing all 50 states. In reality, this would be a painful process, particularly
with the Jet database engine, where the rule would have to be retyped for every
field declared against the domain. It’s much, much easier to create a USStates
lookup table and use referential integrity to ensure that the field values are
restricted to the values stored in the table.
Entity Integrity
In the database schema, entity constraints can govern individual fields, multiple
fields, or the table as a whole. Both the Jet database engine and SQL Server
provide mechanisms for ensuring integrity at the entity level. Not surprisingly,
SQL Server provides a richer set of capabilities, but the gap is not as great as one
might expect.
At the level of individual fields, the most fundamental integrity constraint is of
course the data type. Both the Jet database engine and SQL Server provide a rich
set of data types, as shown in the table below.
As we saw in the previous section, SQL Server also allows fields to be declared
against UDDTs. A UDDT field inherits the nullability, defaults, and rules that
were defined for the type, but these can be overridden by the field definition.
Logically the field definition should only narrow UDDT constraints, but in fact
SQL Server simply replaces the UDDT definition in the field description. It is
thus possible to allow a field to accept nulls even though the UDDT against
which it is declared does not.
Both SQL Server and the Jet database engine provide control over whether a
field is allowed to contain nulls. When defining a column in SQL Server, one
simply specifies NULL or NOT NULL or clicks the appropriate box in the
Enterprise Manager.
The Jet database engine equivalent of the null flag is the Required field. In
addition, the Jet database engine provides the AllowZeroLength flag which
determines whether empty strings (“”) are permitted in Text and Memo fields.
This constraint can be implemented in SQL Server using a CHECK constraint.
Simply setting the appropriate property when defining the field sets default
values in the Jet database engine. In SQL Server, you can set the Default
property when creating the field or you can bind a system default to the field as
described for UDDTs. Declaring the default as part of the table definition is
certainly cleaner and the option I would generally recommend if you do not (or
cannot) declare the default at the domain level.
Finally, both the Jet database engine and SQL Server allow specific entity
constraints to be established. The Jet database engine provides two field
properties, ValidationRule and ValidationText. SQL Server allows CHECK
constraints to be declared when the field is defined or system rules to be bound
to the field afterwards. CHECK
constraints are the preferred method.
At first glance, the Jet database engine validation rules and SQL Server CHECK
constraints appear to be identical, but there are some important differences. Both
take the form of a logical expression, and neither is allowed to reference other
tables or columns. However, a Jet database engine validation rule must evaluate
to True for the value to be accepted. A SQL Server CHECK constraint must not
evaluate to False. This is a subtle point: both True and Null are acceptable values
for a CHECK
constraint; only a True value is acceptable for a validation rule.
In addition, multiple CHECK constraints can be defined for one SQL Server
field.
In fact, one rule and any number of CHECK constraints can be applied to a
single SQL
Server field, whereas a Jet database engine field has a single ValidationRule
property. The Jet database engine ValidationText property setting, by the way, is
returned to the front end as an error message. Microsoft Access displays the text
in a message box; it is available to Microsoft Visual Basic and other
programming environments as the text of the Errors collection.
Entity constraints that reference multiple fields in a single table are implemented
as table validation rules in the Jet database engine and table CHECK
constraints in SQL Server. Other than being declared in a different place, these
table-level constraints function in precisely the same way as their corresponding
field-level constraints.
The most fundamental entity integrity constraint is the requirement that each
instance of an entity be uniquely identifiable. Remember that this is the entity
integrity rule; all others are more properly referred to as entity-level integrity
constraints. The Jet database engine and SQL Server support uniqueness
constraints in pretty much the same way, but the support looks quite different.
Both engines implement the constraints using indices, but SQL Server hides this
from the user. Whether one explicitly creates an index (Jet database engine) or
declares a constraint (SQL Server) is largely a mechanical detail.
Both the Jet database engine and SQL Server support the definition of sets of
fields as being unique. Both also support the definition of a set of one or more
fields as being the primary key, which implies uniqueness. There can be only one
primary key for a table, although it can consist of multiple fields. There can be
any number of unique constraints.
The other important difference between unique constraints and primary keys is
that unique indices can contain nulls; primary keys cannot. There are some
differences in the way the two engines treat nulls in unique indices. The Jet
database engine provides a property, IgnoreNulls, which prevents records
containing Null values in the indexed columns from being added to the index.
The records are added to the table but not included in the index. This capability
is not available in SQL Server.
In addition, SQL Server allows only a single record containing NULL in the
index.
This is logically insupportable since it treats records with NULL values as being
equal, which of course they are not. A null is not equal to anything, including
another null.
Interestingly, neither the Jet database engine nor SQL Server requires that a
primary key be defined for a table or even that it have a unique constraint. In
other words, it is possible to create tables that are not relations since tuples in
relations must be uniquely identifiable but records in tables need not be. Why
one would want to do this escapes me, but I suppose it’s nice to know that the
possibility is there if you should ever need it.
SQL Server also provides a procedural mechanism for providing entity-level
integrity that the Jet database engine does not provide. Triggers are little bits of
code (specifically, Transact-SQL code) that are automatically executed when a
specific event occurs. Multiple triggers can be defined for each INSERT,
UPDATE, or DELETE event, and a given trigger can be defined for multiple
events.
Referential Integrity
While their support for entity integrity is substantively the same, the Jet database
engine and SQL Server implement different paradigms for supporting referential
integrity. SQL Server allows foreign key constraints to be declared as part of the
table definition. A foreign key constraint establishes a reference to a candidate
key in another table, the primary table. Once the reference is established, SQL
Server prevents the creation of orphan records by rejecting any insertions that do
not have a matching record in the primary table. Nulls are not prohibited in
foreign key columns, although they can be prevented if the column participates
in the primary key of the table, which is often the case. SQL Server also
prohibits the deletion of records in the primary table if they have any matching
foreign key values.
The Jet database engine supports referential integrity through a Relation object
within the database. Microsoft’s terminology is unfortunate here—the Jet
database engine Relation object is a physical representation of the relationship
between two entities. Don’t confuse the Relation object with the logical relations
that are defined in the data model.
The simplest way of creating Relation objects is in the Access user interface
(using the Relationships command on the Tools menu), but they can also be
created in code. The Data Access Object (DAO) Relation object’s Table and
ForeignTable properties define the two tables participating in the relationship,
while the Fields collection defines the linked fields in each table.
The manner in which the Jet database engine will maintain referential integrity
for the relation is governed by the Attributes property of the relation, as shown
in the table below.
Note the attribute flags dbRelationUpdateCascade and
dbRelationDeleteCascade. If the update flag is set and a referenced field is
changed, the Jet database engine will automatically update the matching fields in
the foreign table. Similarly, the delete flag will cause the matching records to be
automatically deleted from the foreign table. SQL Server does not have
comparable automatic flags, but the cascading behavior can easily be
implemented using triggers
Other Kinds of Integrity
In the data model, we define three additional kinds of integrity: database,
transition, and transaction. Some transition constraints are simple enough to be
declared as validation rules. Most, however, as well as all database and
transaction constraints, must be implemented procedurally. For SQL Server
databases, this means using triggers. Since the Jet database engine does not
support triggers, these constraints must be implemented in the front end.
Phylosofy of NoSQL data-bases
According to Wikipedia:
In computing, NoSQL (mostly interpreted as “not only SQL”) is a broad
class of database management systems identified by its non-adherence to the
widely used relational database management system model; that is, NoSQL
databases are not primarily built on tables, and as a result, generally do not
use SQL for data manipulation.
The NoSQL movement began in the early years of the 21st century when the world
started its deep focus on creating web-scale database. By web-scale, I mean scale to
cater to hundreds of millions of users and now growing to billions of connected
devices including but not limited to mobiles, smartphones, internet TV, in-car
devices, and many more.
Although Wikipedia treats it as “not only SQL”, NoSQL originally started off as a
simple combination of two words—No and SQL—clearly and completely visible in
the new term. No acronym. What it literally means is, “I do not want to use SQL”.
To elaborate, “I want to access database without using any SQL syntax”. Why? We
shall explore the in a while.
Whatever be the root phrase, NoSQL today is the term used to address to the class
of databases that do not follow relational database management system (RDBMS)
principles, specifically being that of ACID nature, and are specifically designed to
handle the speed and scale of the likes of Google, Facebook, Yahoo, Twitter, and
many more.
What NoSQL is and what it is not
Now that we have a fair idea on how this side of the world evolved, let us examine at
what NoSQL is and what it is not.
NoSQL is a generic term used to refer to any data store that does not follow the
traditional RDBMS model—specifically, the data is non-relational and it does not use
SQL as the query language. It is used to refer to the databases that attempt to solve
the problems of scalability and availability against that of atomicity or consistency.
NoSQL is not a database. It is not even a type of database. In fact, it is a term used to
filter out (read reject) a set of databases out of the ecosystem. There are several distinct
family trees available. In Chapter 4, Advantages and Drawbacks, we explore various types
of data models (or simply, database types) available under this umbrella.
Traditional RDBMS applications have focused on ACID transactions:
• Atomicity: Everything in a transaction succeeds lest it is rolled back.
• Consistency: A transaction cannot leave the database in an inconsistent state.
• Isolation: One transaction cannot interfere with another.
• Durability: A completed transaction persists, even after applications restart.
Howsoever indispensible these qualities may seem, they are quite incompatible
with availability and performance on applications of web-scale. For example, if a
company like Amazon were to use a system like this, imagine how slow it would be.
If I proceed to buy a book and a transaction is on, it will lock a part of the database,
specifically the inventory, and every other person in the world will have to wait until
I complete my transaction. This just doesn’t work!
Amazon may use cached data or even unlocked records resulting in inconsistency.
In an extreme case, you and I may end up buying the last copy of a book in the store
with one of us finally receiving an apology mail. (Well, Amazon definitely has a
much better system than this).
The point I am trying to make here is, we may have to look beyond ACID to
something called BASE, coined by Eric Brewer:
• Basic availability: Each request is guaranteed a response—successful or
failed execution.
• Soft state: The state of the system may change over time, at times without
any input (for eventual consistency).
• Eventual consistency: The database may be momentarily inconsistent but
will be consistent eventually.
Eric Brewer also noted that it is impossible for a distributed computer system to
provide consistency, availability and partition tolerance simultaneously. This is
more commonly referred to as the CAP theorem.
Note, however, that in cases like stock exchanges or banking where transactions
are critical, cached or state data will just not work. So, NoSQL is, definitely, not a
solution to all the database related problems
Why NoSQL?
Looking at what we have explored so far, does it mean that we should look at
NoSQL only when we start reaching the problems of scale? No.
NoSQL databases have a lot more to offer than just solving the problems of scale
which are mentioned as follows:
• Schemaless data representation: Almost all NoSQL implementations offer
schemaless data representation. This means that you don’t have to think too
far ahead to define a structure and you can continue to evolve over time—
including adding new fields or even nesting the data, for example, in case of
JSON representation.
• Development time: I have heard stories about reduced development
time because one doesn’t have to deal with complex SQL queries. Do you
remember the JOIN query that you wrote to collate the data across multiple
tables to create your final view?
• Speed: Even with the small amount of data that you have, if you can deliver
in milliseconds rather than hundreds of milliseconds—especially over
mobile and other intermittently connected devices—you have much higher
probability of winning users over.
• Plan ahead for scalability: You read it right. Why fall into the ditch and
then try to get out of it? Why not just plan ahead so that you never fall into
one. Or in other words, your application can be quite elastic—it can handle
sudden spikes of load. Of course, you win users over straightaway.
SQL vs NoSQL
RDBMS approach
The traditional approach—using RDBMS—takes the following route:
• Identify actors: The first step in the traditional approach is to identify
various actors in the application. The actors can be internal or external to
the application.
• Define models: Once the actors are identified, the next step is to create
models. Typically, there is many-to-one mapping between actors and
models, that is, one model may represent multiple actors.
• Define entities: Once the models and the object-relationships—by way of
inheritance and encapsulation—are defined, the next step is to define the
database entities. This requires defining the tables, columns, and column
types. Special care has to be taken noting that databases allow null values
for any column types, whereas programming languages may not allow,
databases may have different size constraints as compared to really required,
or a language allows, and much more.
• Define relationships: One of more important steps is to be able to well
define the relationship between the entities. The only way to define
relationships across tables is by using foreign keys. The entity relationships
correspond to inheritance, one-to-one, one-to-many, many-to-many, and
other object relationships.
• Program database and application: Once these are ready, engineers program
database in PL/SQL (for most databases) or PL/pgSQL (for PostgreSQL)
while software engineers develop the application.
• Iterate: Engineers may provide feedback to the architects and designers
about the existing limitations and required enhancements in the models,
entities, and relationships.
Challenges
The aforementioned approach sounds great, however, it has a set of challenges. Let us
explore some of the possibilities that ACME Foods has or may encounter in future:
• The technical team faces a churn and key people maintaining the
database—schema, programmability, business continuity process a.k.a.
availability, and other aspects—leave. The company has a new engineering
team and, irrespective of its expertise, has to quickly ramp up with existing
entities, relationships, and code to maintain.
• The company wishes to expand their web presence and enable online orders.
This requires either creating new user-related entities or enhancing the
current entities.
• The company acquires another company and now needs to integrate the two
database systems. This means refining models and entities. Critically, the
database table relationships have to be carefully redefined.
• The company grows big and has to handle hundreds of millions of queries a
day across the country. More so, it receives a few million orders. To scale, it
has tied up with thousands of suppliers across locations and must provide
away to integrate the systems.
• The company ties up with a few or several customer facing companies and
intends to supply services to them to increase their sales. For this, it must
integrate with multiple systems and also ensure that its application must be
able to scale up to the combined needs of these companies, especially when
multiple simultaneous orders are received in depleting inventory.
• The company plans to provide API integration for aggregators to retrieve
and process their data. More importantly, it must ensure that the API must
be forward compatible meaning that in future if it plans to change their
internal database schema for whatever reasons, it must—if at all—minimally
impact the externally facing API and schema for data-exchange.
• The company plans to leverage social networking sites, such as Facebook,
Twitter, and FourSquare. For this, it seeks to not only use the simple widgets
provided but also gather, monitor, and analyze statistics gathered.
The preceding functional requirements can be translated into the following technical
requirements as far as the database is concerned:
• Schema flexibility: This will be needed during future enhancements and
integration with external applications —outbound or inbound. RDBMS are
quite inflexible in their design.
More often than not, adding a column is an absolute no-no, especially if the
table has some data and the reason lies in the constraint of having a default
value for the new column and that the existing rows, by default, will have
that default value. As a result you have to scan through the records and
update the values as required, even if it can be automated. It may not be
complex always, but frowned upon especially when the number of rows is
large or number of columns to add is sufficiently large. You end up creating
new tables and increase complexity by introducing relationships across the
tables.
• Complex queries: Traditionally, the tables are designed denormalized
which means that the developers end up writing complex so-called JOIN
queries which are not only difficult to implement and maintain but also take
substantial database resources to execute.
• Data update: Updating data across tables is probably one of the more
complex scenarios especially if they are to be a part of the transaction.
Note that keeping the transaction open for a long duration hampers the
performance.
One also has to plan for propagating the updates to multiple nodes across
the system. And if the system does not support multiple masters or writing
to multiple nodes simultaneously, there is a risk of node-failure and the
entire application moving to read-only mode.
• Scalability: More often than not, the only scalability that may be required is
for read operations. However, several factors impact this speed as operations
grow. Some of the key questions to ask are:
° What is the time taken to synchronize the data across physical
database instances?
° What is the time taken to synchronize the data across datacenters?
° What is the bandwidth requirement to synchronize data? Is the data
exchanged optimized?
° What is the latency when any update is synchronized across servers?
Typically, the records will be locked during an update.
NoSQL approach
NoSQL-based solutions provide answers to most of the challenges that we put
up. Note that if ACME Grocery is very confident that it will not shape up as we
discussed earlier, we do not need NoSQL. If ACME Grocery does not intend to grow,
integrate, or provide integration with other applications, surely, the RDBMS will
suffice. But that is not how anyone would like the business to work in the long term.
So, at some point in time, sooner or later, these questions will arise.
Let us see what NoSQL has to offer against each technical question that we have:
• Schema flexibility: Column-oriented databases (http://en.wikipedia.
org/wiki/Column-oriented_DBMS) store data as columns as opposed to
rows in RDBMS. This allows flexibility of adding one or more columns as
required, on the fly. Similarly, document stores that allow storing semistructured data are also good options.
• Complex queries: NoSQL databases do not have support for relationships
or foreign keys. There are no complex queries. There are no JOIN statements.
Is that a drawback? How does one query across tables?
It is a functional drawback, definitely. To query across tables, multiple
queries must be executed. Database is a shared resource, used across
application servers and must not be released from use as quickly as possible.
The options involve combination of simplifying queries to be executed,
caching data, and performing complex operations in application tier.
A lot of databases provide in-built entity-level caching. This means that as
and when a record is accessed, it may be automatically cached transparently
by the database. The cache may be in-memory distributed cache for
performance and scale.
• Data update: Data update and synchronization across physical instances are
difficult engineering problems to solve.
Synchronization across nodes within a datacenter has a different set of
requirements as compared to synchronizing across multiple datacenters. One
would want the latency within a couple of milliseconds or tens of milliseconds
at the best. NoSQL solutions offer great synchronization options.
MongoDB (http://www.mongodb.org/display/DOCS/
Sharding+Introduction), for example, allows concurrent
updates across nodes (http://www.mongodb.org/display/DOCS/
How+does+concurrency+work), synchronization with conflict resolution and
eventually, consistency across the datacenters within an acceptable time that
would run in few milliseconds. As such, MongoDB has no concept of isolation.
Note that now because the complexity of managing the transaction may
be moved out of the database, the application will have to do some hard
work. An example of this is a two-phase commit while implementing
transactions (http://docs.mongodb.org/manual/tutorial/
perform-two-phase-commits/).
Do not worry or get scared. A plethora of databases offer Multiversion
concurrency control (MCC)to achieve transactional consistency
(http://en.wikipedia.org/wiki/Multiversion_concurrency_control).
Surprisingly, eBay does not use transactions at all (http://www.infoq.com/
interviews/dan-pritchett-ebay-architecture). Well, as Dan Pritchett
(http://www.addsimplicity.com/), Technical Fellow at eBay
puts it, eBay.com does not use transactions. Note that PayPal does use
transactions.
• Scalability: NoSQL solutions provider greater scalability for obvious
reasons. A lot of complexity that is required for transaction oriented RDBMS
does not exist in ACID non-compliant NoSQL databases.
Interestingly, since NoSQL do not provide cross-table references and there
are no JOIN queries possible, and because one cannot write a single query to
collate data across multiple tables, one simple and logical solution is to—at
times—duplicate the data across tables. In some scenarios, embedding the
information within the primary entity—especially in one-to-one mapping
cases—may be a great idea.
Having understood the modeling options available in NoSQL along with the RDBMS
knowledge that we already had, we are now in a position to understand the pros and
cons of using NoSQL.
In this chapter, we will take three different representative application requirements
and discuss advantages and disadvantages of using NoSQL, and conclude whether
or not to use NoSQL in that specific scenario.
We will analyze by understanding the database requirements, identifying the
advantages and drawbacks, and finally coming to a conclusion on NoSQL use
as follows:
• Entity schema requirements: What is the density of relationships across
entities, whether or not the schema change over time and if so, what is
the frequency
• Data access requirements: Should the data be always consistent or can be
eventually consistent (say, after 10 ms or 10 days or whatever), would the
data access be more horizontal (row-wise) or vertical (column-wise)
• What NoSQL can do: What does NoSQL have to offer in the given scenario
• What NoSQL cannot do: Where NoSQL fails or leaves a lot of work to be
done at the application tier in the given scenario.
• Whether or not to use NoSQL: If NoSQL is a good choice, which data model
fits best with it.
At a broad level, I have classified the applications into the following categories:
• Relational-data driven, transaction-heavy applications
• Data-driven, computation-heavy applications
• Web-scale, data-driven applications where minor latencies are acceptable
Transactional application
This type of application has data that is highly relational in nature. Important
application characteristics are:
• The application relies on the consistency and integrity of the data
• The concurrency (usage) is relatively lower
° Lower is a relative adjective—all that we are looking for here is that
the application can be served by a single database instance without
any replication or load-balance requirements
° It may choose to have mirrors or replication for automatic failover
or otherwise but the application talks to a maximum of one
instance—may switch over in case of failover
An example of this is the application at the point-of-sales at a grocery shop or an
enterprise resource management application. The key here is data consistency.
Entity schema requirements
For a transactional application, general requirements for the entity schema include:
• Highly structured definition—properties, types, and constraints, if applicable
• Ability to define relationships—parent versus child keys
• Schema does not evolve or vary too much over time.
Data access requirements
From a data access perspective, we would have the following requirements:
• Consistency—any read should return latest updated data
• More often than not, the entire record (row) will be retrieved
• Cross-entity (table) data access may be frequent
What NoSQL can do
For a transactional application, NoSQL can help as follows:
• Column-oriented database helps define structure. If need be, it can be
changed over time.
• Document-oriented database can help implement JOIN or create views.
What NoSQL cannot do
Using NoSQL will result in a few limitations for a transactional application:
• Inability to define relationships
• Absence of transactions (Neo4j is an exception)
• Unavailability of ACID properties in operations (CouchDB and Neo4j are
the exceptions)
• Using Cassandra or CouchDB can be overkill if we compare them to, for
example, MySQL or PostgreSQL—they do not unleash their true power
in a single-machine installation
• Absence of support for JOIN and cross-entity query, though documentoriented stores support it by way of MapReduce but the efforts may be
substantial as the queries get complex
Decision
Choose RDBMS rather than NoSQL. The disadvantages outweigh the advantages
Computational application
This type of application does a lot of computation in the application. Key
characteristics of this application type are:
• Most of the operations are along a given set of properties across the records
• The data stored may still be relational but the computation-centric data
definitely has sparse relations, if any
An example of this type of application is a survey application or a sales record
management application. The key here is that the computation is across the records
but on a subset of the properties or columns available (the subset in an extreme
case can be 100 percent of the available columns). Also, a survey application would
require ability to define custom fields and operations.
Entity schema requirements
For a computational application, general requirements on the entity schema include:
• Structured schema definition—properties, types, and constraints, if
applicable.
• Schema does not evolve or vary too much over time.
Data access requirements
From a data access perspective, we would have the following requirements:
• Partial record access.
• Vertical, that is, column-wise processing.
For example, for an entity holding the data of daily revenues and expenses at
a given location of operation, one would compute across the revenue column
and/or expense column more often than working on the entire row.
• Cross-entity data access is infrequent.
For example, to create the final balance sheet for a location, I may use the
location ID once to get the details about the location.
• Consistency—data should be consistent. However, in some cases, minor
latencies are allowed.
For example, since the reports may be generated daily rather than in realtime, the user is happy working with day-old data
What NoSQL can do
For a computational application, NoSQL can help as follows:
• Column-oriented databases would help define rigorous schema. Document
or key-value databases can still be used.
For example, JSON formats can be used to define a formal schema. Just that
these stores cannot enforce the schema.
• They (column-oriented, key-value as well as document stores) can provide
speed and scale while fetching partial data.
• Document stores coupled with MapReduce processing can help performing
computation right up close to the data-tier thereby substantially increasing
the speed of execution. You do not want unnecessary data to be floating
across the network.
• Document stores can help implement JOIN or create views.
What NoSQL cannot do
Using NoSQL will result in a few limitations for a computational application:
• Defining relationships can be tricky. Except for graph databases, these
relationships must be maintained in application.
• Because relationships do not exist in the database, data can be inconsistent—
even if momentarily.
Decision
The list of tasks that are possible in NoSQL are also possible with RDBMS. The only
brownie point that NoSQL gets here, which can turn out to be a big differentiator,
is the speed and scale at which data can be partitioned horizontally and fetched
property-wise.
RDBMS systems do allow the filtering of queries such as the one given in the
following:
SELECT revenue, expense FROM location_finance WHERE location_id=1234
However, internally RDBMS systems are tuned to fetch the entire row at the time
from the underlying physical store and then apply the filter—especially the columns
selected using the SELECT statement.
On the other hand, NoSQL databases—especially column-oriented
databases—are highly optimized to retrieve partial records and can result in a
dramatic performance difference against RDBMS while dealing with hundreds of
millions or billions of records.
To conclude, it is a tough choice between RDBMS and NoSQL databases in this case.
If the application is, say, an enterprise application wherein the number of records
will be limited to around, for example, a hundred million or lower, RDBMS can
just serve it right, though NoSQL can also be used. For any size less than a million
records, NoSQL can have overheads in terms of setup and tuning; while for over a
few hundred million records, RDBMS will taper down in performance.
Web-scale application
This last application type is probably more relevant today in consumer applications,
whether they are completely web-based or mobile-native apps or a combination
of both.
Some of the key characteristics of this type of application are:
• The application should be able to scale because of the enormous volume
of content that it operates on, the sheer number of users, and the vast
geography where the users access it because of which one datacenter
is unfeasible.
• The users of this application may be fine working with non-real-time,
relatively stale data. The staleness may range from few tens of milliseconds
to few days, but the latest data may definitely not be available within the
fraction of millisecond.
• The schema may evolve over time as the application allows integration with
other applications.
• Since the data can never be completely normalized or denormalized, the
relationships will exist.
An example of this application is a web analytics application or a social
microblogging platform. The key here is the scale of operation and staleness of data.
Another set of examples includes SaaS-based enterprise-grade applications such as
CRM or ERP. One brilliant example of this is SalesForce—it is a SaaS application that
allows you to integrate the data of your schema.
Entity schema requirements
For a web-scale application, general requirements on the entity schema include:
• Structured schema definition
• The ability to change schema over time without affecting existing records in
any manner—in extreme case, latent schema
• Relationships may be optional at database layer and can be pushed to
application layer mainly because of the low density
Data access requirements
From a data access perspective, we would have the following requirements:
• Partial record access
• Speed of operation execution—CRUD
• Inconsistent data—for a short moment—is tolerable
What NoSQL can do
For a web-scale application, NoSQL can help as follows:
• Everything already covered in the previous scenario.
Document stores would be a fantastic choice for latent schema.
• It can provide scale of operations because it does not implement ACID
operations but mostly provide BASE properties.
What NoSQL cannot do
I do not see any requirement that NoSQL cannot fulfill in this case. Note that we do
not have ACID constraints—one of the main reasons why NoSQL was invented.
Decision
Use NoSQL. The choice of the type of store (data model) can vary depending upon
the actual underlying requirement:
• In case of SaaS applications where the schema has to be flexible to
incorporate user-specific attributes, document stores are the optimal choice.
Examples of this subtype include applications such as CRM, ERP—mainly
enterprise-targeted applications where each end consumer (organization)
may have their own specific schema.
• In case of applications like e-learning or other social applications, whose
underlying schema changes and evolves at a fast pace and would need the
ability to change schema over time but still be under the tight control of the
application developer, column-oriented database is a great choice.
In the case of social applications that need ability to integrate with other
application, it may want to use a mix of column-oriented and documentoriented store to mitigate the risk of complete schema overhaul of unforeseen
data format of a new application that becomes a rage and this application
must integrate with the new application in the market.
• To store relationships, graph databases may be an addendum to the
actual data store. For example, Twitter uses graph database, FlockDB
(https://github.com/twitter/flockdb), to store relationships while
it uses Cassandra for real-time analytics (http://www.slideshare.net/
kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011),
and most likely HBase for persistent store.
If there is no real reason to resign from SQL database do not do this. This is the main rule J.
The main reason why you can ask youself why you need to move to NoSQL databases are necessity in huge very huge data storage (in other words Big Data), scalability and performance reasons.
But, you can ask yourself, what is the difference between them? Why I have to use one instead of another? The table below contains the brief difference explanation. More details beneath in the article.Relational databases were developed in 1970s. It was very courteous way to store data and satisfied those day needs. But today, relational databases are not able to solve the needs of the current scope for storing gigantic data. NoSQL technic for storing data is a substitute for solving the nowadays needs.
MongoDB is NoSQL database with large quantity of supported features. MongoDB in comparison with relational databases is more fast and easy-scalable.
There are a couple of features which are not supported: JOINs and global transactions. The main reason why MongoDB doesn’t support them is the way how it was designed.
Data Storage Model
Relational databases can’t live without JOINs and transactions. Normalization in SQL databases expected to save data in multiple tables and JOINs can help to combine them during requests. Also, transaction is the only way to be sure that data are consistent in different tables.
MongoDB was designed to store a huge data (BigData), to be easy-scalable, high-available and fast. JOINs and global transactions were a stumbling-block for designing such system and the only way to get rid of them is to use another data storage model. It means to avoid data splitting between tables during normalization process. Key-value pairs combined into the document were chosen as data storage model. The data are stored mainly in one collection (set of documents) without necessity to JOIN and watch over data consistency. MongoDB is called document-oriented database.
Agile practices
Relational databases were not designed to support Agile software development and be easy scalable, but NoSQL databases on the other hand support them.
The modern world uses Agile practices to develop products, agile sprints with quick iteration is the one of the main goals. Dynamic schema is the key of Agile support in MongoDB.
Database Schema Relational databases (SQL) require established data schemas for data storage, before data will be added. Schema necessity doesn’t fit with agile development approaches. It’s hard to predict an exact db schema at the beginning of the feature development. If schema is changed in relational database you must think over data migration to the new schema. If data is large it’s a very slow process that involves significant downtime. If data is changed regularly the downtime may also be frequent. NoSQL database in contrast to relational databases doesn’t require a predefined schema and what is more it doesn’t require schema at all. We can call it schemaless database. This feature is fully fit into the Agile approaches. The same collection can contain data with a significantly differing structure.
Scalability
Relational database can be scaled just vertically, because entire database has to be hosted in a single server. This is necessary in order to ensure reliability and continuous availability of data. Vertical scaling is too expensive, places limits on scale and weak fault-tolerance.
NoSQL databases were designed to scale horizontally. Instead of increasing power of one single server you just need to add more server instances to get expected power.
Usually it’s too expensive and limited to buy more RAM or more powerful server instead of just add one small workstation into the cluster.
Sharding is the process of storing data records across multiple machines. It can be configured for SQL and NoSQL databases. But sharding for SQL databases is not natively supported. It can be configured through complex arrangements for making hardware act as a single server, but NoSQL databases, on the other hand, usually support auto-sharding on the native layer. It means that they automatically spread data across shards (server), support balancing, query loading and etc.
Query Language
Relational databases use SQL as query language to retrieve data. It’s very powerful query language which was designed for managing data in a relational database management system. The scope of SQL includes data insert, query, update, delete, schema creation and modification and data access control.
MongoDB uses JSON-style declarative language for specifying what data to retrieve from database. This query language contains all amounts of SQL features even more. Here you can find mapping chart between SQL and MongoDB.
Types of NOsql DB
The NoSQL databases are categorized on the basis of how the data is stored. Because
of the need to provide curated information from large volumes, generally in near
real-time, NoSQL mostly follows a horizontal structure. They are optimized for
insert and retrieve operations on a large scale with built-in capabilities for replication
and clustering. Some of the functionalities of SQL databases like functions, stored
procedures, and PL may not be present in most of the databases.
Storage types There are various storage types available in which the content can be modeled for NoSQL databases. In subsequent sections, we will explore the following storage types: • Column-oriented • Document Store • Key Value Store • Graph
Column-oriented databases
The column-oriented databases store data as columns as opposed to rows that
is prominent in RDBMS.
A relational database shows the data as two-dimensional tables comprising of rows
and columns but stores, retrieves, and processes it one row at a time, whereas a
column-oriented database stores data as columns.
Online transaction processing (OLTP) focused relational databases are row
oriented. Online analytical processing (OLAP) systems that require processing
of data need column-oriented access. Having said that, OLTP operations may also
require column-oriented access when working on a subset of columns and operating
on them.
Data access to these databases is typically done by using either a proprietary protocol
in case of commercial solutions or open standard binary (for example, Remote
Method Invocation). The transport protocol is generally binary.
Some of the databases that fall under this category include:
• Oracle RDBMS Columnar Expression
• Microsoft SQL Server 2012 Enterprise Edition
• Apache Cassandra
• HBase
• Google BigTable (available as part of Google App Engine branded Datastore)
Advantages
Most of the solutions, such as Apache Cassandra, HBase, and Google Datastore,
allow adding columns over time without having to worry about filling in default
values for the existing rows for the new columns. This gives flexibility in model
and entity design allowing one to account for new columns in future for unforeseen
scenarios and new requirements.
There are advantages when working with a subset of the available columns. For
example, computing maxima, minima, averages and sums, specifically on large
datasets, is where these column-oriented data stores outshine in performance.
Similarly, when new values are applied for either all rows at once or with samecolumn filters, these databases will allow partial data access without touching
unrelated columns and be much faster in execution.
Since columns will be of uniform type and mostly (except in cases of variable-length
strings) of the same length, there are possibilities of efficient storage in terms of size.
Such as a column with the same values across rows (for example, the department of
a user profile or whether a user’s profile is public or private or even a user’s age), the
same or similar adjacent values can be compressed efficiently.
Document store
Also referred to as document-oriented database, a document store allows the
inserting, retrieving, and manipulating of semi-structured data. Most of the
databases available under this category use XML, JSON, BSON, or YAML, with
data access typically over HTTP protocol using RESTful API or over Apache Thrift
protocol for cross-language interoperability.
Compared to RDBMS, the documents themselves act as records (or rows), however,
it is semi-structured as compared to rigid RDBMS.
For example, two records may have completely different set of fields or columns.
The records may or may not adhere to a specific schema (like the table definitions
in RDBMS). For that matter, the database may not support a schema or validating
a document against the schema at all.
Document-oriented databases provide this flexibility—dynamic or changeable
schema or even schemaless documents. Because of the limitless flexibility provided
in this model, this is one of the more popular models implemented and used.
Some of popular databases that provide document-oriented storage include:
• MongoDB
• CouchDB
• Jackrabbit
• Lotus Notes
• Apache Cassandra
• Terrastore
• Redis
• BaseX
Advantages
The most prominent advantage, as evident in the preceding examples, is that
content is schemaless, or at best loosely defined. This is very useful in web-based
applications where there is a need for storing different types of content that may
evolve over time. For example, for a grocery store, information about the users,
inventory and orders can be stored as simple JSON or XML documents. Note that
“document store” is not the same as “blob store” where the data cannot be indexed.
Based on the implementation, it may or may not be possible to retrieve or update a
record partially. If it is possible to do so, there is a great advantage. Note that stores
based on XML, BSON, JSON, and YAML would typically support this. XML-based
BaseX can be really powerful, while integrating multiple systems working with XML
given that it supports XQuery 3.0 and XSLT 2.0.
Searching across multiple entity types is far more trivial compared to doing so in
traditional RDBMS or even in column-oriented databases. Because, now, there is no
concept of tables—which is essentially nothing more than a schema definition—one
can query across the records, irrespective of the underlying content or schema or in
other words, the query is directly against the entire database. Note that the databases
allow for the creation of indexes (using common parameters or otherwise and evolve
over time).
JSON-based stores are easy to define what I call projections. Each top-level key
for the JSON object may be the entity’s projection across other parts of the system
thereby allowing the schema to evolve over time with backward compatibility.
Key-value store
A Key-value store is very closely related to a document store—it allows the storage
of a value against a key. Similar to a document store, there is no need for a schema to
be enforced on the value. However, there a are few constraints that are enforced by a
key-value store (http://ayende.com/blog/4459/that-no-sql-thing-documentdatabases):
• Unlike a document store that can create a key when a new document is
inserted, a key-value store requires the key to be specified
• Unlike a document store where the value can be indexed and queried, for a
key-value store, the value is opaque and as such, the key must be known to
retrieve the value
If you are familiar with the concept of maps or associative arrays (http://
en.wikipedia.org/wiki/Associative_array) or have worked with hash tables
(http://en.wikipedia.org/wiki/Hash_table), then you already have worked
with a in-memory key-value store.
The most prominent use of working with a key-value store is for in-memory
distributed or otherwise cache. However, implementations do exist to provide
persistent storage.
A few of the popular key value stores are:
• Redis (in-memory, with dump or command-log persistence)
• Memcached (in-memory)
• MemcacheDB (built on Memcached)
• Berkley DB
• Voldemort (open source implementation of Amazon Dynamo)
Advantages
Key-value stores are optimized for querying against keys. As such, they serve great
in-memory caches. Memcached and Redis support expiry for the keys—sliding or
absolute—after which the entry is evicted from the store.
At times, one can generate the keys smartly—say, bucketed UUID—and can query
against ranges of keys. For example, Redis allows retrieving a list of all the keys
matching a glob-style pattern.
Though the key-value stores cannot query on the values, they can still understand
the type of value. Stores like Redis support different value types—strings, hashes,
lists, sets, and sorted sets. Based on the value types, advanced functionalities can be
provided. Some of them include atomic increment, setting/updating multiple fields
of a hash (equivalent of partially updating the document), and intersection, union,
and difference while working with sets.
Graph store
Graph databases represent a special category of NoSQL databases where
relationships are represented as graphs. There can be multiple links between two
nodes in a graph—representing the multiple relationships that the two nodes share.
The relationships represented may include social relationships between people,
transport links between places, or network topologies between connected systems.
Graph databases are fairly new in the market with only a few proven solutions
out there:
• Neo4j
• FlockDB (from Twitter)
Advantages
An article found at http://www.infoq.com/articles/graph-nosql-neo4j quotes
the advantages as follows:
Graph theory has seen a great usefulness and relevance in many problems across
various domains. The most applied graph theoretic algorithms include various
types of shortest path calculations, geodesic paths, centrality measures like
PageRank, eigenvector centrality, closeness, betweenness, HITS, and many others.
Graph databases can be considered as special purpose NoSQL databases optimized
for relation-heavy data. If there is no relationship among the entities, there is no
usecase for graph databases.
The one advantage that graph databases have is easy representation, retrieval
and manipulation of relationships between the entities in the system.
It is not uncommon to store data in a document store and relationships in a
graph database.
Multi-storage type databases
Apart from the databases that we discussed earlier, following is a list of databases
that support multiple storage types, giving you flexibility:
• OrientDB: Supports document store, key-value as well as graph. The official
website is http://www.orientdb.org.
• ArangoDB: Universal database with support for document store, key-value
and graph models. Official website is http://www.arangodb.org.
• Aerospike: A very interesting database that is a hybrid between RDBMS
and NoSQL store. It supports document store, key-value, graph as well as
RDBMS. Source code can be found at https://github.com/JakSprats/
Alchemy-Database.
mongoDB: advantages and disadvantages
A lot of people are making the switch to MongoDB, however many have neglected to (or outright refuse to) weigh the pros and cons of such a decision. As a relatively new and unfamiliar technology, it can have catastrophic effects when attempting to implement in a production system. In light of this, I’ve attepted to conduct some research on the topic to find out exactly why one should or shouldn’t (or under which circumstances one should) use MongoDB in place of a traditional relational database.
Background Information
MongoDB is a document-based NoSQL database. Documents are JSON-style data structures composed of field-and-value pairs for example:
{ “name”: “mongo”, “age”: 5 }
MongoDB stores documents on disk in the BSON serialization format. BSON is a binary representation of JSON documents. Additionally MongoDB uses a JavaScript-based shell and supports JavaScript functions for things such as map-reduce.
A document database has many advantages, but before selecting to use one you should verify that your data fits a document model. Documents should be self-contained pieces of data with structure much like a typical document that a person can print on paper. This previous statement is highlighted to emphasize its importance. I find that data which completely fits the document model is actually quite rare.
One would likely argue that data such as a forum with users, threads, and posts is definitely more relational data. But many people suggest that articles fit better to a document model. This may sometimes be the case, but what if articles are written by users who each attributes of their own? What if articles can belong to categories which each have some metadata of their own? At what point to we acknowledge that the data is actually relational? This is a pretty difficult question to answer.
Putting aside the question of “is it relational or not” for the time being, I began to research MongoDB and why some businesses begin to use it and why others chose not to. To begin, I found this video to be very interesting and informative for explaining how a document database works, when it makes sense to use one, and when not to use one. It also clearly explains many of the fallbacks when using MongoDB, and the issue that many people attempt to make the switch to MongoDB without analyzing first whether their data fits into a document model.
After a bit of research to try to find reasons why MongoDB is better than a traditional relational database, and reasons why it is worse, I constructed a pretty good ‘pros and cons’ list.
Advantages
1. Sharding and Load-Balancing:
Sharding is the process of storing data records across multiple machines and is MongoDB’s approach to meeting the demands of data growth. As the size of the data increases, a single machine may not be sufficient to store the data nor provide an acceptable read and write throughput. Sharding solves the problem with horizontal scaling. With sharding, you add more machines to support data growth and the demands of read and write operations.
-From MongoDB
When you have extremely large amounts of data or you need to distribute your database traffic across multiple machines for load-balancing purposes, MongoDB has heavy advantages over many classic relational databases such as MySQL.
2. Speed
MongoDB queries can be much faster in some cases, especially since your data is typically all in once place and can be retrieved in a single lookup. However, this advantage only exists when your data is truly a document. When your data is essentially emulating a relational model, your code ends up performing many independent queries in order to retrieve a single document and can become much slower than a classic RDBMS.
3. Flexibility
MongoDB doesn’t require a unified data structure across all objects, so when it is not possible to ensure that your data will all be structured consistently, MongoDB can be much simpler to use than an RDBMS. However, data consistency is a good thing, so when possible you should always attempt to ensure that a unified structure will be applied.
Disadvantages
1. No Joins
In MongoDB there exists no possibility for joins like in a relational database. This means that when you need this type of functionality, you need to make multiple queries and join the data manually within your code (which can lead to slow, ugly code, and reduced flexibility when the structure changes).
2. Memory Usage
MongoDB has the natural tendency to use up more memory because it has to store the key names within each document. This is due to the fact that the data structure is not necessarily consistent amongst the data objects.
Additionally you’re stuck with duplicate data since there is no possibility for joins, or slow queries due to the need to perform the join within your code. To solve the problem of duplicate data in Mongo, you can store references to objects (i.e. BSON ids), however if you find yourself doing this it indicates that the data is actually “relational”, and perhaps a relational database suits your needs better.
3. Concurrency Issues
When you perform a write operation in MongoDB, it creates a lock on the entire database, not just the affected entries, and not just for a particular connection. This lock blocks not only other write operations, but also read operations.
MongoDB uses a readers-writer lock that allows concurrent reads access to a database but gives exclusive access to a single write operation.
When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.
Locks are “writer greedy,” which means writes have preference over reads. When both a read and write are waiting for a lock, MongoDB grants the lock to the write.
-From MongoDB
4. Young Software: Inexperienced User-Base; Still Under Construction; Little Documentation
MongoDB is relatively young, particularly compared to relational databases. SQL was released in 1986, so it has been quite extensively tested and documented, and there exists quite a bit of support. MongoDB, however, has only been around since 2009, so most users are not experts and there does not exist such a large base of knowledge available on the web by comparison.
This leads to very common misunderstandings such as the one related to the “safe mode” option.
When safe mode is set to false, writing is basically asynchronous; it is actually just an insertion into a staging document, not an actual database write. The data will only be written to the database at a later time, so you don’t have any assurance about data integrity.
Often developers don’t understand that without safe=True, the data may never get written (e.g. in case of error), or may get written at some later time. We had many problems (such as intermittently failing tests) where developers expected to read back data they had written with safe=False.
-From ScrapingHub
Note that safe=false was the default setting before November, 2012.
The code is still very much under development, so that makes MongoDB an interesting piece of software to play with, but when you’re trying to develop a professional project it might not be good to be a beta tester.
For example, before version 2.2, MongoDB did not yet have their aggregation framework, so in order to aggregate data you were stuck using group(). The problem with group() is that can only handle up to 10,000 unique keys, and additionally it returns a single document which means you can’t perform any sort of “order by”.
Users also often make the mistake of believing that instead of using a database along with some form of caching system, they can just use MongoDB as their database and the cache. This is not what MongoDB was designed for. When you want to put an object into cache use something like Memcached; that’s what its purpose is.
5. Transactions
MongoDB doesn’t automatically treat operations as transactions. In order to ensure data integrity upon create/update you have to manually choose to create a transaction, manually verify it, and then manually commit or rollback.
Operations on a single document are always atomic with MongoDB databases; however, operations that involve multiple documents, which are often referred to as “transactions,” are not atomic. Since documents can be fairly complex and contain multiple “nested” documents, single-document atomicity provides necessary support for many practical use cases.
Thus, without precautions, success or failure of the database operation cannot be “all or nothing,” and without support for multi-document transactions it’s possible for an operation to succeed for some operations and fail with others.
-From MongoDB
Conclusion
It’s clear that the cons list is a somewhat longer list here, but that isn’t to say that MongoDB should never be used or that it isn’t the optimal solution in some cases. It also doesn’t mean that MongoDB doesn’t have a lot of potential; it is still a very young and immature database which is currently within the beginning of its development phase.
Developers need to be aware that MongoDB is not a replacement for relational databases; it is designed for a specific scenario in which it excels. And the point should also be made that one should definitely not attempt to force a square peg through a circular hole. Don’t try to force your data into a particular model just so that you can use a particular database; you will have issues in the future if you try. Don’t do things just because they’re cool; do them because it makes sense.
Link:
https://halls-of-valhalla.org/beta/articles/the-pros-and-cons-of-mongodb,45/
Phylosofy of key-value storages
What is a Key-Value Store?
A key-value store, or key-value database is a simple database that uses an associative array (think of a map or dictionary) as the fundamental data model where each key is associated with one and only one value in a collection. This relationship is referred to as a key-value pair.
In each key-value pair the key is represented by an arbitrary string such as a filename, URI or hash. The value can be any kind of data like an image, user preference file or document. The value is stored as a blob requiring no upfront data modeling or schema definition.
The storage of the value as a blob removes the need to index the data to improve performance. However, you cannot filter or control what’s returned from a request based on the value because the value is opaque.
In general, key-value stores have no query language. They provide a way to store, retrieve and update data using simple get, put and delete commands; the path to retrieve data is a direct request to the object in memory or on disk. The simplicity of this model makes a key-value store fast, easy to use, scalable, portable and flexible.
Scalability and Reliability
Key-value stores scale out by implementing partitioning (storing data on more than one node), replication and auto recovery. They can scale up by maintaining the database in RAM and minimize the effects of ACID guarantees (a guarantee that committed transactions persist somewhere) by avoiding locks, latches and low-overhead server calls.
Use Cases and Implementations
Key-value stores handle size well and are good at processing a constant stream of read/write operations with low latency making them perfect for:
- Session management at high scale
- User preference and profile stores
- Product recommendations; latest items viewed on a retailer website drive future customer product recommendations
- Ad servicing; customer shopping habits result in customized ads, coupons, etc. for each customer in real-time
- Can effectively work as a cache for heavily accessed but rarely updated data
Key-value stores differ in their implementation where some support ordering of keys like Berkeley DB, FoundationDB and MemcacheDB, some maintain data in memory (RAM) like Redis, some, like Aerospike, are built natively to support both RAM and solid state drives (SSDs). Others, like Couchbase Server, store data in RAM but also support rotating disks. Some popular key-value stores are:
- Aerospike
- Apache Cassandra
- Berkeley DB
- Couchbase Server
- Redis
- Riak
Key-Value Store Vs Cache
Sometimes likened to a key-value store because of its ability to return a value given a specific key, a cache transparently stores a pool of read data so that future requests for the data can be quickly accessed at a later time to improve performance.
Data stored in a cache can be precomputed values or a copy of data stored on disk. When an application receives a request for data and it resides in the cache (called a hit), the request can be served by reading the cache, which is fast. If on the other hand, the requested information does not reside in the cache (called a miss) the requested data must be recomputed or retrieved from its original source which results in a delay.
Caches and key-value store do have differences. Where key-value stores can be used as a database to persist data, caches are used in conjunction with a database when there is a need to increase read performance. Caches are not used to enhance write or update performance yet key-value stores are very effective. Where key-value stores can be resilient to server failure, caches are stored in RAM and cannot provide you with transactional guarantees if the server crashes.
Summary
For most key-value stores, the secret to its speed lies in its simplicity. The path to retrieve data is a direct request to the object in memory or on disk. The relationship between data does not have to be calculated by a query language; there is no optimization performed. They can exist on distributed systems and don’t need to worry about where to store indexes, how much data exists on each system or the speed of a network within a distributed system they just work.
Some key-value stores like Aerospike, take advantage of other attributes to extend performance, such as using SSD’s or flash storage and implementing secondary indexes to continue to push the limits of today’s technology to places we’ve not yet conceived.
Link:
https://www.aerospike.com/what-is-a-key-value-store/