CC6 - Chapter 2 Flashcards
is the process of discovering, analyzing, and scoping data requirements
, and then representing and communicating these data requirements in a precise form called the data model. Data modeling is a critical component of data management.
Data modeling
is answering the question of “how”
* How the data will be gathered
* How the data will analysed
* How the data requirements will be grouped depending on their subset
* After that processes makakabuo na ng data model by communicating the data requirements.
Data modeling
are critical to effective management of data. They:
Provide a common vocabulary around data
Capture and document explicit knowledge about an organization’s data and systems
Serve as a primary communications tool during projects
Provide the starting point for customization, integration, or even replacement of an application
Data models
Goals and Principles
Confirming and documenting understanding of different perspectives facilitates:
Formalization: A data model documents a concise definition of data structures and relationships
. It enables assessment of how data is affected by implemented business rules, for current as-is states or desired target states.
Scope definition: A data model can help explain the boundaries for data context and implementation
of purchased application packages, projects, initiatives, or existing systems.
Knowledge retention/documentation: A data model can preserve corporate memory
regarding a system or project by capturing knowledge in an explicit form. It serves as documentation for future projects to use as the as-is version.
is most frequently performed in the context of systems development and maintenance efforts, known as the system development lifecycle (SDLC)
.
Data modeling
is a representation of something that exists or a pattern
for something to be made. A model can contain one or more diagrams.
model
describes an organization’s data as the organization understands it
, or as the organization wants it to be. A data model contains a set of symbols with text labels that attempts visually to represent data requirements as communicated to the data modeler, for a specific set of data that can range in size from small, for a project, to large, for an organization.
Data model
: Data used to classify and assign types
to things. For example, customers classified by market categories or business sectors; products classified by color, model, size, etc.; orders classified by whether they are open or closed.
Category information
: Basic profiles of resources needed conduct operational processes such as Product, Customer, Supplier, Facility, Organization, and Account.
Resource information
: Data created while operational processes
are in progress. Examples include Customer Orders, Supplier Invoices, Cash Withdrawal, and Business Meetings.
Business event information
: is often produced through point-of-sale systems
(either in stores or online).
Detail transaction information
- is a thing about which an
organization collects information
. - sometimes referred to as the
nouns of an organization
. - can be thought of as the answer to a fundamental question –
who, what, when, where, why, or how
– or to a combination of these questions.
entity
are the occurrences or values of a particular entity
Entity instances
Entity -/ type, instance
Entity – Jane, Employee
Entity type – Employee
Entity instance – Jane
Entity – Raine, Lecturer
Entity type – Lecturer
Entity instance – Raine
In _________ the term relationship
is often used, _________________ the term navigation path
is often used, and in _____________ terms such as **edge
or link
**are used, for example._______ can also vary based on level of detail. A relationship at the conceptual and logical levels is called a relationship, but a relationship at the physical level may be called by other names, such as constraint or reference, depending on the database technology.
relational schemes
dimensional schemes
NoSQL schemes
Relationship aliases
Relationships between two entities
Cardinality is represented by the symbols that appear on both ends of a relationship line.
Data rules are specified and enforced through cardinality.
* Without cardinality, the most one can say about a relationship is that two entities are connected in some way.
The number of entities
in a relationship is the __________________ of the relationship. The most common are unary, binary, and ternary relationships
‘arity’
relationship involves only one entity
. A one-to-many
recursive relationship describes a hierarchy, whereas a many-to-many
relationship describes a network or graph. In a hierarchy, an entity instance has at most one parent (or higher-level entity). In relational modeling, child entities are on the many side of the relationship, with parent entities on the one side of the relationship. Ina network, an entity instance can have more than one parent.
unary (also known as a recursive or self-referencing)
An arity of two is also known as _____________. A binary relationship, the most common on a traditional data model diagram, involves two entities.
binary
An arity of three, known as ________, is a relationship that includes three entities. An example in fact-based modeling (object-role notation) appears in Figure 35. Here Student can register for a particular Course in a given Semester.
ternary
- is used in
physical and sometimes logical relational data
modelling schemes to represent a relationship. - may be
created implicitly when a relationship is defined between two entities
, depending on the database technology or data modeling tool, and whether the two entities involved have mutual dependencies.
foreign key
(also called a key) is a set of one or more attributes that uniquely defines an instance of an entity
. This section defines types of keys by construction
(simple, compound, composite, surrogate)
and function
(candidate, primary, alternate).
identifier
is one attribute that uniquely identifies an entity instance.
* Ex. Universal Product Codes (UPCs) and Vehicle Identification Numbers(VINs).
simple key
- is also an example of a
simple key
. - is a
unique identifier for a table
. Often a counter and always system-generated without intelligence, a surrogate key is an integer whose meaning is unrelated to its face value.
surrogate key
is a set of two or more
attributes that together uniquely identify an entity instance. Ex. Phone number (area code + exchange + local number).
compound key
contains one compound key
and at least one other simple or compound key or non-key attribute.
composite key
A is any set of attributes
that uniquely identify an entity instance.
super key
A is a minimal set of one or more attributes
(i.e., a simple or compound key) that identifies the entity instance to which it belongs.
candidate key
is one or more attributes that a business professional
would use to retrieve a single entity instance.
business key
is the candidate key that is chosen to be the unique identifier for an entity
.
primary key
can still be used to find specific entity instances
. Often the primary key is a surrogate key and the ____________________ are business keys.
alternate key
is one where the primary key contains only attentityributes that belong to that entity.
independent entity
is one where the primary key contains at least one attribute from another entity
.
dependent entity
: Domains that specify the standard types of data
one can have in an attribute assigned to that domain. For example, Integer, Character(30), and Date are all data type domains.
Data Type
: Domains that use patterns
including templates and masks, such as are found in postal codes and phone numbers, and character limitations (alphanumeric only, alphanumeric with certain special characters allowed, etc.) to define valid values.
Data Format
: Domains that contain a finite set of values
. These are familiar to many people from functionality like dropdown lists.
* For example, the list domain for OrderStatusCode can restrict values to only {Open, Shipped, Closed, Returned}.
List
: Domains that allow all values of the same data type
that are between one or more minimum and/or maximum values. Some ranges can be open-ended
.
* For example, OrderDeliveryDate must be between OrderDate and three months in the future.
Range
: Domains defined by the rules
that values must comply with in order to be valid. These include rules comparing values to calculated values or other attribute values in a relation or set.
* For example, ItemPrice must be greater than ItemCost.
Rule-based
The use of schemes depends in part on the database being built
, as some are suited to particular technologies
Data Model Schemes
CDM, LDM, PDM
- In aCDM, you can define data items and entity attributes. In aLDM, you can only define entity attributes.
- In theCDM, the foreign attribute migration does not occur until you generate aLDMorPDM.
- In theLDM, the foreign attribute migrates immediately.
- Conceptual, logical, physical data models(PDM)
First articulated by Dr. Edward Codd in 1970, _______________ provides a systematic way to organize data
so that they reflected their meaning(Codd, 1970). This approach had the additional effect of reducing redundancy in data storage
relational theory
The concept of _________ started from a joint research project conducted by General Mills and Dartmouth College in the 1960’s. 33 In dimensional models, data is structured to optimize the query and analysis of large amounts of data
. In contrast, operational systems that support transaction processing are optimized for fast processing of individual transactions.
dimensional modeling
The three main types of change are sometimes known by ORC.
-
Overwrite (Type 1): The
new value overwrites the old value
in place. -
New Row (Type 2): The n
ew values are written in a new row
, and the old row is marked as not current. -
New Column (Type 3): Multiple instances of a
value are listed in columns
on the same row, and a new value means writing the values in the series one spot down to make space at the front for the new value. The last value is discarded.
is the term given to normalizing the flat, single-table, dimensional structure
in a star schema into the respective component hierarchical or network structures.
Snowflaking
stands for the meaning or description of a single row of data
in a fact table; this is the most detail any row will have.
grain
are built with the entire organization in minD
instead of just a particular project; this allows these dimensions to be shared across dimensional models, due to containing consistent terminology and values.
Conformed dimensions
use standardized definitions of termS
across individual marts. Different business users may use the same term in different ways
.
‘Customer additions’ may be different from ‘gross additions’ or ‘adjusted additions.’
Conformed facts
- is a
graphical language
for modeling software. - has a
variety of notations
of which one (the class model) concerns databases. - class model specifies
classes
(entity types) and theirrelationship
types (Blaha, 2013).
Unified Modeling Language (UML)
has Operations or Methods (also called its “behavior”). Class behavior is only loosely connected to business logic because it still needs to be sequenced and timed. In ER terms, the table has stored procedures/triggers. Class Operations can be:
class
, a family of conceptual modeling languages
, originated in the late 1970s. Fact-based languages view the world in terms of objects, the facts that relate or characterize those objects, and each role that each object plays in each fact.
Fact-Based Modeling
do not use attributes
, reducing the need for intuitive or expert judgment by expressing the exact relationships between objects (both entities and values).
Fact-based models
is a model-driven engineering approach
that starts with typical examples of required information or queries presented in any external formulation familiar to users, and then verbalizes these examples at the conceptual level, in terms of simple facts expressed in a controlled natural language.
Object-Role Modeling (ORM)
is similar in notation and approach to ORM
. The numbers in Figure 43 are references to verbalizations of facts.
Fully Communication Oriented Modeling (FCO-IM)
are used when data values must be associated in chronological order and with specific time values.
Time-based patterns
is a detail-oriented, time-based, and uniquely linked set of normalized tables
that support one or more functional areas of business. Itis a hybrid approach, encompassing the best of breed between third normal form and star schema. Data Vaults are designed specifically to meet the needs of enterprise data warehouses.
Data Vault
is a technique suited for information that changes overtime in both structure and content
. It provides graphical notation used for conceptual modeling similar to traditional data modeling, with extensions for working with temporal data.
Anchor Modeling
is a name for the category of databases
built on non-relational technology
.
NoSQL
Instead of taking a business subject and breaking it up into multiple relational structures, document databases frequently store the business subject in one structure called a ___________.
document
databases allow an application to store its data in only two columns
(‘key’ and ‘value’), with the feature of storing both simple (e.g., dates, numbers, codes) and complex information (unformatted text, video, music, documents, photos) stored within the ‘value’ column.
Key-value
Out of the four types of NoSQL databases, _________________ is closest to the RDBMS
. Both have a similar way of looking at data as rows and values.
column-oriented
A____________ database is designed for data whose relations are well represented
as a set of nodes with an undetermined number of connections between these nodes.
graph
: This embodies the ‘real world’ view
of the enterprise being modeled in the database. It represents the current ‘best model’ or ‘way of doing business’
for the enterprise.
Conceptual
: The various users of the database management system operate on subsets
of the total enterprise model that are relevant to their particular needs. These subsets are represented as ‘external schemas’.
External
: The ‘machine view’ of the data
is described by the internal schema. This schema describes the stored representation of the enterprise’s information
Internal
This section provides an overview of conceptual, logical, and physical data modeling.
Data Model
A ____________ captures the high-level data requirements
as a collection of related concepts. It contains only the basic and critical business entities within a given realm and function, with a description of each entity and the relationships between entities.
conceptual data model
A _____________________ is a detailed representation of data requirements,
usually in support of a specific usage context, such as application requirements. Logical data models are still independent of any technology or specific implementation constraints.
* often begins as an extension of a conceptual data model
logical data model
A is in many cases a fully-attributed perspective of the dimensional conceptual data model, as illustrated in Figure 49.
dimensional logical data model
A ___________ represents a detailed technical solution
, often using the logical data model as a starting point and then adapted to work within a set of hardware, software, and network tools. Physical data models are built for a particular technology.
physical data model (PDM)
A variant of a physical scheme is a , used for data in motion between systems
.
This model describes the structure of data
being passed between systems as packets or messages. When sending data through web services, an Enterprise Service Bus (ESB), or through Enterprise Application Integration (EAI), the canonical model describes what data structure the sending service and any receiving services should use.
Canonical Model
is a virtual table
.
provide a means to look at data from one or many tables that contain or reference the actual attributes. A standard view runs SQL to retrieve data at the point when an attribute in the view is requested. An instantiated (often called ‘materialized’) view runs at a predetermined time. Views are used to simplify queries, control data access, and rename columns, without the redundancy and loss of referential integrity due to denormalization.
view
refers to the process of splitting a table
. It is performed to facilitate archiving and to improve retrieval performance
Partitioning
Vertically vs. Horizontally split
-
Vertically split: To reduce query sets,
create subset tables that contain subsets of columns
.
For example, split a customer table in two based on whether the fields are mostly static or mostly volatile (to improve load / index performance), or based on whether the fields are commonly or uncommonly included in queries (to improve table scan performance). -
Horizontally split: To reduce query sets,
create subset tables using the value of a column as the differentiator
.
For example, create regional customer tables that contain only customers in a specific region.
- is the deliberate
transformation of normalized logical data model entities into physical tables
with redundant or duplicate data structures. There are several reasons to denormalize data. - can also be used to
enforce user security
by segregating data into multiple views or copies of tables according to access needs. This process does introduce a risk of data errors due to duplication.
Denormalization
In dimensional data modeling, is called collapsing or combining. If each dimension is collapsed into a single structure
, the resulting data model is called a Star Schema (see Figure 51). If the dimensions are not collapsed
, the resulting data model is called a Snowflake (See Figure 49).
denormalization
is the process of applying rules in order to organize business complexity into stable data structures
. The basic goal of normalization is to keep each attribute in only one place to eliminate redundancy and the inconsistencies that can result from redundancy.
Normalization
: Ensures each entity has a valid primary key
, and every attribute depends on the primary key
; removes repeating groups, and ensures each attribute is atomic(not multi-valued). 1NF includes the resolution of many-to-many relationships with an additional entity often called an associative entity.
First normal form (1NF)
: Ensures each entity has the minimal primary key
and that every attribute depends on the complete primary key.
Second normal form (2NF)
: Ensures each entity has no hidden primary key
s and that each attribute depends on no attributes outside the key
(“the key, the whole key and nothing but the key”).
Third normal form (3NF)
: Resolves overlapping composite candidate keys
.
* A candidate key is either a primary or an alternate key
.
* ‘Composite’ means more than one
(i.e., two or more attributes in an entity’s primary or alternate keys), and
* ‘overlapping’ means there are hidden business rules between the keys
.
Boyce / Codd normal form (BCNF)
: Resolves all many-to-many-to-many relationships
(and beyond) in pairs until they cannot be broken down into any smaller pieces.
Fourth normal form (4NF)
: Resolves inter-entity dependencies
into basic pairs, and all join dependencies use parts of primary keys.
Fifth normal form (5NF)
is the removal of details
in such a way as to broaden applicability to a wide class of situations while preserving the important properties and essential nature from concepts or subjects.
includes generalization and specialization
.
* Generalization groups the common attributes and relationships
of entities into super type entities, while specialization separates distinguishing attributes within an entity into subtype entities.
* This specialization
is usually based on attribute values
within an entity instance.
Abstraction
is the concept of exposing only the required essential characteristics and behavior with respect to a context.
Abstraction