Business Intelligence & Data Management Flashcards
Data
Unprocessed raw facts, items that are the most elementary descriptions of things, events, activities, and transactions.
Information
Organized data that has meaning and value
Knowledge
Processed data or information that is applicable to a business decision problem
Structured data
Data that has structure, what you typically find in a database/sheet
Unstructured data
Data that has no structure, that is complex. Data that you find ‘in the wild’. E.g. images, sound, video, written text.
Data management
discipline that focuses on the proper generation, storage and retrieval of data
Database
A shared and integrated computer structure that stores a collection of end user data and metadata.
Database management System (DBMS)
Collection of programs that manages the DB structure and controls access to the database.
Single user database
Database that supports only one user at a time. E.g. desktop database
Multiuser database
Database that supports multiple users at a time. E.g. Workgroup database or enterprise database.
Centralized database
Database for which the data is stored at a single site.
Distributed database
Database for for the data is stores across several sites
Operational database
Also known as transactional or production database. A database that support an organization’s day to day operations
Data warehouse (description)
A system with the primary focus of storing data to be used to generate information to make tactical or strategic decisions.
Semi-structured data
Data that you have already prepared to some extent.
Extensible Markup Language (XML)
A special language used to represent and manipulate data elements in a textual format.
XML Database
Database that supports the storage and management of semi-structured XML data
Database design
Refers to the activities that focus on the design of a database. If done properly, it requires the designer to identify precisely the database’s expected use.
Data model
A simple representation of a more complex real world data structure.
Entity
Any physical or conceptual object about which data is to be collected or stored.
Attribute
Characteristics of an entity (e.g. weight, class, size, value, name)
Relationship
AKA Cardinality. The association between entities. Can be:
One to many (1, M)
Many to many (M, N)
One to one (1, 1)
Constraint
Restriction placed on an attribute in a database. E.g. Employee age must be between 16 and 99.
Table
Contains rows and columns . Is a persistent representation of logical relations.
Relational table
Table for which each row can be uniquely identified
Primary key
Attribute or combination of attributes that uniquely identify an entity (or row in database). Is also a super key and a primary key. Primary key is always underlined.
Key
One or more attributes that determine other attributes. E.g. an order ID determines what products where bought by who.
Composite key
A uniquely identifying key that can only uniquely identify an entity because it is comprised of two or more non-uniquely identifying keys. For example, a postal code + house number to identify a house.
Functional dependence
Attribute Y is functionally dependent if attribute X determines Y.
Full functional dependence
If Attribute X determines attribute Y, but attribute Y is not determined by attribute Z (or any other attribute).
Super key
Key that uniquely identifies each rows, that determines all of the entity’s attributes
Candidate key
A minimal (irreducible) superkey. A superkey that does not contain a subset of attributes that is itself a superkey.
Entity integrity
Each primary key must be unique as to ensure that a each entity is uniquely identified by only one primary key.
Controlled redundancy
How relational databases work. By placing foreign keys in tables in database, you are able to refer to an entity from another table. The foreign key is present both in the referring and referred table, thus storing redundant information.
Referential integrity
Whether the foreign key contains a value, that value refers to an existing valid tuple in another relation
Secondary key
A key that is used strictly for data retrieval purposes only. This key doesn’t necessarily result in a unique outcome (multiple entities can have a shared secondary key).
One-to-many (1:M) relationship
For example a mother (1) has two or more children (M).
In this case, the kids have (1) mother but she can have more (M) children.
One-to-one (1:1) relationship
For example a car can (at one time) only be driven by one person, and that person can only drive one car (at one time).
Many-to-many (M:N) relationships
For example a company can have multiple employees and each employee can have multiple employers (jobs).
In order to make this relationship in a database you need a compose entity. This is a table that links each unique combined relationship.
Compose entity
AKA bridge entity or associative entity. A (linking) table that is used to link two other tables that have a M:N relationship. The compose entity has the primary keys of the other to tables, as to indicate the unique relationship.
Normalization
The process of evaluating and correcting table structure to minimize data redundancies. There are 4+1 levels of data redundancy. It is not always required to go to the highest normalization level. More normalization means more join operations.
Denormalization
The process of lowering the normal form. For example, 3NF can be converted to 2NF through denormalization. It decreases the number of required operations to manage the database (increasing performance), increases the data redundancy.
Data redundancy anomalies
There are three types:
- Update anomalies
- Insertion anomalies
- Deletion anomalies
If the same data is stored in multiple tables, you can accidentally forget to update all the versions of the data, you might accidentally delete data you still need, etc.
Internal data
Data that comes from within the company
External data
Data that comes from elsewhere but is useful for a company
Business Intelligence
Is the combination of data warehousing and descriptive analytics. It is an umbrella term that combines processes, technology and tools.
Business Analytics
Is the combination of predictive and prescriptive analytics. However in this course both refer to the same thing.
Database system
Consists of: Data, Software, Hardware, Users
Relational database
Allows data to be grouped into table and sets relationships between tables
Structured Query Language (SQL)
Most popular querying language for relational databases. SQL is used at the backend and frontend for many databases management systems.
Join (SQL)
Joins two tabel together in multiple different ways:
(Inner) Join - returns the entries that are shared between two tables
Left/Right outer join - Returns the entire left/right table plus shared entries between both
Full outer join - Return both tables in full
Minus - Returns one table except entries that occur in common.
Production stage
Consists of:
Different platforms and databases. Internal, external, (un)structured data, inconsistent data, limited history
Extraction Transformation Load (ETL) stage
Stage that extracts data from operational / production stage. Transforms data to make it fit the warehouse. Then loads into warehouse.
Data warehouse (makeup)
Relational DBMS
Is of high quality, subject-oriented, integrated, time-variant and nonvolatile.
Data mart
Subset (or small warehouse) of a data warehouse to support a specific Business Unit
Metadata
Data about data. For example the location, meaning or origin of other data.
Business intelligence (front-end) application
Used for:
Querying and reporting
Data mining
Data visualization
Subject-oriented data
Focusses on the analysis of data for the decision makers
Provides a simple view around a particular subject by excluding irrelevant / not useful data
Integrated data
Data is combined (integrated) with data from multiple heterogeneous data sources in a clean and consistent way
Time-variant data
The ability to store data as it existed in multiple points in time.
Non-volatile data
This means that data is not updated once it’s in the datawarehouse. Only new data is being added (for example a new point in time).
Six data warehouse architectures
- Direct BI on source systems
- Canned DW
- Independent data marts with bus architecture
- Enterprise DW
- Hub & Spoke (enterprise + data marts) - Most popular
- Federated DW
DW Development approaches
Bottom-up - Start with data marts for each BU, then integrate those
Top-down - Start with one large DW, then make smaller marts for each BU
Data Lake
Database that holds raw data in its’ native format until it’s needed.
Features of DW (Data, structure, processing, security, Users)
Data Warehouse Data - Cleaned, aggregated Data structure - Structured, processed Processing -Focus on write Security - Mature Users - Business professionals and data scientists
Features of Data Late (Data, structure, processing, security, Users)
Data Lake Data - Raw Data structure - Unstructured Processing - Focus on read Security - Maturing Users - Data scientists
Data converting
Converting data types into other types. For example converting integer 0 & 1 to boolean true and false, also currency conversions.
Data clearing
Validation and filtering of data
Integral load
In this case the ETL stage loads all the data (both changed and unchanged) into the DW at one time
Delta / Incremental load
Here the ETL stage only loads the changes since the previous load