Business Intelligence & Data Management Flashcards
Data
Unprocessed raw facts, items that are the most elementary descriptions of things, events, activities, and transactions.
Information
Organized data that has meaning and value
Knowledge
Processed data or information that is applicable to a business decision problem
Structured data
Data that has structure, what you typically find in a database/sheet
Unstructured data
Data that has no structure, that is complex. Data that you find ‘in the wild’. E.g. images, sound, video, written text.
Data management
discipline that focuses on the proper generation, storage and retrieval of data
Database
A shared and integrated computer structure that stores a collection of end user data and metadata.
Database management System (DBMS)
Collection of programs that manages the DB structure and controls access to the database.
Single user database
Database that supports only one user at a time. E.g. desktop database
Multiuser database
Database that supports multiple users at a time. E.g. Workgroup database or enterprise database.
Centralized database
Database for which the data is stored at a single site.
Distributed database
Database for for the data is stores across several sites
Operational database
Also known as transactional or production database. A database that support an organization’s day to day operations
Data warehouse (description)
A system with the primary focus of storing data to be used to generate information to make tactical or strategic decisions.
Semi-structured data
Data that you have already prepared to some extent.
Extensible Markup Language (XML)
A special language used to represent and manipulate data elements in a textual format.
XML Database
Database that supports the storage and management of semi-structured XML data
Database design
Refers to the activities that focus on the design of a database. If done properly, it requires the designer to identify precisely the database’s expected use.
Data model
A simple representation of a more complex real world data structure.
Entity
Any physical or conceptual object about which data is to be collected or stored.
Attribute
Characteristics of an entity (e.g. weight, class, size, value, name)
Relationship
AKA Cardinality. The association between entities. Can be:
One to many (1, M)
Many to many (M, N)
One to one (1, 1)
Constraint
Restriction placed on an attribute in a database. E.g. Employee age must be between 16 and 99.
Table
Contains rows and columns . Is a persistent representation of logical relations.
Relational table
Table for which each row can be uniquely identified
Primary key
Attribute or combination of attributes that uniquely identify an entity (or row in database). Is also a super key and a primary key. Primary key is always underlined.
Key
One or more attributes that determine other attributes. E.g. an order ID determines what products where bought by who.
Composite key
A uniquely identifying key that can only uniquely identify an entity because it is comprised of two or more non-uniquely identifying keys. For example, a postal code + house number to identify a house.
Functional dependence
Attribute Y is functionally dependent if attribute X determines Y.
Full functional dependence
If Attribute X determines attribute Y, but attribute Y is not determined by attribute Z (or any other attribute).
Super key
Key that uniquely identifies each rows, that determines all of the entity’s attributes
Candidate key
A minimal (irreducible) superkey. A superkey that does not contain a subset of attributes that is itself a superkey.
Entity integrity
Each primary key must be unique as to ensure that a each entity is uniquely identified by only one primary key.
Controlled redundancy
How relational databases work. By placing foreign keys in tables in database, you are able to refer to an entity from another table. The foreign key is present both in the referring and referred table, thus storing redundant information.
Referential integrity
Whether the foreign key contains a value, that value refers to an existing valid tuple in another relation
Secondary key
A key that is used strictly for data retrieval purposes only. This key doesn’t necessarily result in a unique outcome (multiple entities can have a shared secondary key).
One-to-many (1:M) relationship
For example a mother (1) has two or more children (M).
In this case, the kids have (1) mother but she can have more (M) children.
One-to-one (1:1) relationship
For example a car can (at one time) only be driven by one person, and that person can only drive one car (at one time).
Many-to-many (M:N) relationships
For example a company can have multiple employees and each employee can have multiple employers (jobs).
In order to make this relationship in a database you need a compose entity. This is a table that links each unique combined relationship.
Compose entity
AKA bridge entity or associative entity. A (linking) table that is used to link two other tables that have a M:N relationship. The compose entity has the primary keys of the other to tables, as to indicate the unique relationship.
Normalization
The process of evaluating and correcting table structure to minimize data redundancies. There are 4+1 levels of data redundancy. It is not always required to go to the highest normalization level. More normalization means more join operations.
Denormalization
The process of lowering the normal form. For example, 3NF can be converted to 2NF through denormalization. It decreases the number of required operations to manage the database (increasing performance), increases the data redundancy.
Data redundancy anomalies
There are three types:
- Update anomalies
- Insertion anomalies
- Deletion anomalies
If the same data is stored in multiple tables, you can accidentally forget to update all the versions of the data, you might accidentally delete data you still need, etc.
Internal data
Data that comes from within the company
External data
Data that comes from elsewhere but is useful for a company
Business Intelligence
Is the combination of data warehousing and descriptive analytics. It is an umbrella term that combines processes, technology and tools.
Business Analytics
Is the combination of predictive and prescriptive analytics. However in this course both refer to the same thing.
Database system
Consists of: Data, Software, Hardware, Users
Relational database
Allows data to be grouped into table and sets relationships between tables
Structured Query Language (SQL)
Most popular querying language for relational databases. SQL is used at the backend and frontend for many databases management systems.
Join (SQL)
Joins two tabel together in multiple different ways:
(Inner) Join - returns the entries that are shared between two tables
Left/Right outer join - Returns the entire left/right table plus shared entries between both
Full outer join - Return both tables in full
Minus - Returns one table except entries that occur in common.
Production stage
Consists of:
Different platforms and databases. Internal, external, (un)structured data, inconsistent data, limited history
Extraction Transformation Load (ETL) stage
Stage that extracts data from operational / production stage. Transforms data to make it fit the warehouse. Then loads into warehouse.
Data warehouse (makeup)
Relational DBMS
Is of high quality, subject-oriented, integrated, time-variant and nonvolatile.
Data mart
Subset (or small warehouse) of a data warehouse to support a specific Business Unit
Metadata
Data about data. For example the location, meaning or origin of other data.
Business intelligence (front-end) application
Used for:
Querying and reporting
Data mining
Data visualization
Subject-oriented data
Focusses on the analysis of data for the decision makers
Provides a simple view around a particular subject by excluding irrelevant / not useful data
Integrated data
Data is combined (integrated) with data from multiple heterogeneous data sources in a clean and consistent way
Time-variant data
The ability to store data as it existed in multiple points in time.
Non-volatile data
This means that data is not updated once it’s in the datawarehouse. Only new data is being added (for example a new point in time).
Six data warehouse architectures
- Direct BI on source systems
- Canned DW
- Independent data marts with bus architecture
- Enterprise DW
- Hub & Spoke (enterprise + data marts) - Most popular
- Federated DW
DW Development approaches
Bottom-up - Start with data marts for each BU, then integrate those
Top-down - Start with one large DW, then make smaller marts for each BU
Data Lake
Database that holds raw data in its’ native format until it’s needed.
Features of DW (Data, structure, processing, security, Users)
Data Warehouse Data - Cleaned, aggregated Data structure - Structured, processed Processing -Focus on write Security - Mature Users - Business professionals and data scientists
Features of Data Late (Data, structure, processing, security, Users)
Data Lake Data - Raw Data structure - Unstructured Processing - Focus on read Security - Maturing Users - Data scientists
Data converting
Converting data types into other types. For example converting integer 0 & 1 to boolean true and false, also currency conversions.
Data clearing
Validation and filtering of data
Integral load
In this case the ETL stage loads all the data (both changed and unchanged) into the DW at one time
Delta / Incremental load
Here the ETL stage only loads the changes since the previous load
Online Analytical Processing (OLAP)
Enables users to interactively analyze multi-dimensional data from multiple perspectives. It’s intuitive way to organize large amounts of data. The user acts as a sort of explorer. Data is structured as a cube.
Roll up (OLAP)
Aggregating measures to a higher dimension level (from quarter to year)
Drill down (OLAP)
Revers of roll up (from months to days)
Slice & dice (OLAP)
Selecting subsets of cells that satisfy a certain condition
Pivots (OLAP)
Rotates data axis
Lattice Cubes
Still need to figure this one out
OLAP Software basics
OLAP Software is based on:
Measures - the numerical data of thing you want to compute
Dimensions - the different perspectives on those measures, more categorical data
Dimension hierarchies - structure behind the dimensions
Star scheme
A table layout with a fact table (containing numeric data) in the middle, with dimension tables connected to it (containing categorical data).
This scheme limits the number of required SQL joins, increasing performance.
Snowflake scheme
Variance / refinement of the star scheme. Some dimensional data is morlaized into smaller dimension tables.
Central fact table
The central table in a star or snowflake scheme. Multi-valued since all numeric data is stored within the fact table.
Dimension table
The surrounding tables in a star of snowflake scheme. Are single valued with meaningless surrogate keys
Surrogate key
Are primary keys that replace the original (operational) primary key. This in an effort to save storage space. Has a lot of columns and always a 1:M relation with the fact table
Dimensions
Still need to write this one
Conforming dimensions
Dimension tables that are shared between multiple fact tables.
Data mining
The process of discovering new valuable knowledge in databases. Can be both hypothesis (test a claim) and discovery (explore for knowledge) driven.
Can be machine learning, statistics and databases
Big data’s five V’s
Volume - Long and wide
Velocity - High speed
Variability - lots of different numbers, texts, images, etc.
Veracity - Quality issues
Value - Should be valuable to the organization.
Deductions
Example:
All cows give milk.
Betsie is a cow
Betsie gives milk
Abduction
Example: All cows give milk. Betsie gives milk Betsie is a cow This is the wrong way around, ofc.
Induction
Betsie is a cow Betsie gives milk Clarabelle is a cow Clarabelle gives milk All cows give milk Similar to this is how data mining with machine learning works
Regression analysis
The analysis of statistical relationship between variables:
Y= alpha + beta * X+ residual
R-square
A number between 0 and 1. Indicates the goodness of the fit. 1 is perfect, 0 is the worst..
Ordinary least Squares (OLS)
Also referred to as lineair regression. Is a method of estimation used in linear regression
Minimizes the errors associated with each predicted value for Y
Uses square because without squares both positive and negative deviations can appear and cancel each-other out.
Regression model selection
Forward selection
Create a model with one independent variable and add more until you only generate more noise
Backward selection
Create a model with all the indep. variables and delete them until you are left with the ones that influence the dep. variable.
K-nearest neighbors
Is a classification method. Based on a training set you make an algorithm that is able to determine the class of new data. You also split a validation set from the training set to determine value of K. After determining best K use testing set as final check.
Euclidian distance
To calculate the distance between point A and B:
Square root of: (Xb-Xa)^2 + (Yb-Ya)^2
It’s based on the Pythagoras theorem.
Manhattan distance
To calculate the distance between A and B:
Xb-Xa) + (Yb-Ya
Choosing value of K
If K is too small, it will be sensitive to noise and not be representative to the class.
If K too large, the model may may include points from other classes.
To optimize the K test all of them and check for optimal value.
Curse of dimensionality
In case of a high amount of dimensions (with many attributes), everything is ‘far’ apart. This results in no nearby points, unless an extremely large amount of data is found (exponential to the number of dimensions). For 10 dimensions you have 1024 subcubes. You need more than 1024 datapoints in order have on avg 1 per subcube.
Naive Bayes formula
P (Y|X) = (P(X|Y) * P(Y)) / P(X)
OR
P (Y|X1X2Xn) = P(X1X2Xn|Y) * P(Y)/P(X1X2Xn)
Precision
The amount of errors in the class prediction
Precision of 80% means that 80% of the predictions were correct.
Formula:
True Positive / True positive + false positive
Recall
The fraction of correct class predictions over the actual amount in that class.
Recall of 80% means that 80% of the items of a class were found.
Formula:
True Positive / True positive + false negative
F1-Measure
Combination of precision and recall in one score
Formula: (2 * Precision * Recall) / (Precision + recall)
ROC Curve
The ROC Curve is a plot of the recall against 1-precision.
Decision Tree
A data-driven classification method that is easy to explain and can easily be transformed into a ruleset
Information Gain
A method to determine the best way to split data in a decision tree. Calculate M0 - M12 and M0 - M34 using the Gini Index. Whichever is the best is the best way to split data.
Gini Index
1 - (the probability of “yes”)^2 - (the probability of “no”)^2
The lower, the better the split.
Entropy Measure
Similar to the Gini index, but the proportion of the instances that belong to each class K is multiplied by log(2).
Recall Formula
tp/(tp+fn)
Precision Formula
tp/(tp+fp)
Accuracy Formula
(tp+tn)/(tp+tn+fp+fn) or (tp+tn) / n
F1-Measure Formula
2*(Precision * Recall) / (Precision + Recall)
Error Rate Formula
(fp + fn) / N
Univariate
Statistical analysis with a type of data which consists of observations on only a single characteristic or attribute. For example, the salaries of workers in an industry
Multivariate
Statistical analysis with a type of data which consists of observations on two or more characteristic or attribute. For example, the salaries of workers in an industry.
Binary Split
Trees with only binary (two-way) splits are called binary threes
Multiway split
Trees with more than two-way splits are called Multiway trees
Overfitting
When you make a model that fits every point of the training data perfectly but fails to capture the underlying structure resulting in bad predictive or classification power on testing data.
Underfitting
When you make a model that doesn’t fit the training data enough so that it is unable to capture the underlying (actual) structure
Parametric
A analytical method that assumes that the underlying data structure follows a normal distribution and homogenous sample variances
Non-parametric
A analytical method that DOES NOT assume that the underlying data structure follows a normal distribution and homogenous sample variances
Supervised training
A training method that requires you already know the actual class for the training set, allowing you to train the model so that it will result in those classes on testing data.
Unsupervised training
A training method that does not require you to know the actual class for the training set.
Antecedent
Something that existed before, or logically precedes something else
Consequent
Something that exists after, or logically follows something else (following as a result / effect)
Item
A single attribute value pair. For example; Outlook = sunny
Itemset
All items occurring in a transaction or record. For example: Outlook = sunny, Humidity = high, PlayTennis = No
Frequent itemset
An itemset that has minimal support of K (where K is predefined by the user)
Association rule
A rule set that follows the IF - THEN format.
For example: IF humidity = high AND play = no THEN windy = false AND outlook = sunny
Apriori Property
Is a seminal algorithm for mining frequent itemsets for Boolean association rules.
Example:
If (milk bread jam) is a frequent itemset, then (milk bread) is also a frequent itemset
Interesting association rules
Association rules can be interesting for two reason:
Objective measures & Subjective measures
Objective association rule measure
A association rule is objectively interesting if it has:
Support
Confidence
Lift
Subjective association rule measure
A association rule is subjectively interesting if it:
Is unexpected
Is actionable
(Association rule) Support Formula
frequency (X, Y) / N
(Association rule) Confidence Formula
frequency (X, Y) / frequency (X)
(Association rule) Lift Formula
Support / (Support (Y) * Support (X)) OR Confidence / Prob (RHS) OR Prob (LHS and RHS) / (Prob (LHS) * Prob (RHS)
Lift value meaning
If Lift = 0, then the frequency of occurence between LHS and RHS is not existant
If Lift = 1, then LHS and RHS are independent
If Lift > 1, there is a relation between LHS and RHS (LHS is a strong indicator for RHS)
If lift < 1 (not yet sure).
Confusion Matrix order
True Negative, False Positive
False Negative, True Positive
Decision tree node meaning
Red node = Non acceptor
Blue node = Accepter
Left = Non Acceptor
Right = Acceptor
Red node shows:
21
[20,1]
This means 21 were classified as Non-acceptor. 20 Were truly non-acceptor (true negative). 1 was actually acceptor (false negative)
Clustering
Unsupervised classification method where data is divided into natural groups such that objects in the same cluster are similar and objects in different clusters are dissimilar.
Clustering methods
Partitional algorithms:
Construct various partitions and then evaluate them by some criterion
Hierarchical algoritms:
Create a hierarchical decomposition of the objects using some criterion
Partitional Clustering Method
Pre-specify a desired number of cluster, i.e. ‘K’
Assign a cluster to each object (K clusters)
Minimise sum of distances within clusters
Maximise sum of distances between clusters
In this course only K-means is considered
Centroid
A virtual object that is used to represent all the physical objects in a cluster
K-Means clustering
Partitional clustering approach
Each cluster is associated with a centroid (center point of the cluster)
Each point is assigned to the cluster with the closest centroid
Basic algorithm is very simple
K-Means clustering algorithm
- Select K points as the initial centroids (at random)
REPEAT - Form K clusters by assigning all points to the closest centroid
- Recompute the centroid of each cluster
UNTIL the centroids don’t change
Hierarchical Clustering Method
Hierarchical decomposition of the objects using some criterion.
Begin with a matrix with the distances between all the object pairs. Then continue with bottom-up approach (agglomerative)
Bottom-Up (agglomerative) Approach
REPEAT
1. Put each item in its’ own cluster
2. Find the best pair to match in new cluster
UNTIL all clusters are fused together
Single linkage distance
Determined by the distance of the two closest records (nearest neighbors) in the different clusters.
Tends to produce long chains / elongated clusters.
Complete linkage distance
Determined by the maximum distance between any two records in the different clusters.
Tends to produce spherical clusters with consistent diameter.
Centroid linkage distance
Calculated as the distance between the centroids of the different clusters.
Ward’s method
Starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (Error sum of squares)
Dendrogram (Binary tree)
Is a treelike diagram that summarizes the process of clustering. The records are drawn on the X-axis (horizontal). Vertical line reflects distance between records. Use a cutoff value to limit the amount of clusters.
Average linkage distance
Calculated as the average distance between all pairs of records between the different clusters.
Is less affected by outliers.
Aspects to consider when validating clusters
Cluster interpretability
Cluster stability
Cluster separation
Number of clusters
Retrospective analytics
Descriptive analytics and diagnostic analytics
Methods: Database management, Data warehousing framework, OLAP databases & Dashboards
Prospective analytics
Predictive & Prescriptive analytics
Methods: Data Mining Process and all the python models
Static Typing
Variable type is checked at compile-time
The variable itself is defined as as a certain type. All values of the variable need to be that type.
Dynamic Typing
Variable type is checked at run-time
The variable is not defined as a certain type. The value of the variable is assigned a type
Interpreter
Program that directly executes instructions written in a programming language
Compiler
Program that transforms source code in a lower-level (machine) language
Inmon approach
Top-down, Complex, Subject- or data-driven
Enterprise-wide (atomic) DW, Feeds departmental DB’s
Kimball approach
Bottom-up, Simple, Proces Oriented, Dimensional modelling. ‘DW’ is a collection of Data Marts.
Data marts model single Business processes and conformed dimensions
Transient data source
Name for the supporting databases that store intermediate transformation results (ETL-stage)
Transform stage
Simple data conversions (time notation, etc)
Complex type conversions (categorical data, address formats)
Currency conversions
Language conversions
Dimension Hierarchy
The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day
Dimension Hierarchy
The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day
Calculate number of cubes in a multi-dimensional database
(Dimension level + 1) ^ dimensions
Star scheme: restating
Overwrite old values. Use with infrequent changes (e.g. customer’s address), or to correct errors. Kills history (overwriting previous values).
Nominal Data
AKA Categorical data. Values themselves only serve as label. E.g. Sunny, Overcast, Rain
Ordinal Data
Impose order, but no distance between values can be derived. E.g. Hot, Mild, Cold
Interval Data
Impose order and distance between value can be derived. However, no natural zero-point. E.g. Temperature in Fahrenheit or the current year.
Ratio Data
Impose order and distance between values van be derived. Also has a natural zero point. E.g. Length, weight.
SEMMA
Sample, Explore, Modify, Model, Assess
CRISP-DM
Cross Industry Standard Process for Data Mining Step 1: Business Understanding Step 2: Data Understanding Step 3: Data Preparation (I) **Step 1-3 take about 85% of project time.** Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment
Laplace Smoothing
Prevents nullification of the Naive Bayes formula by smoothing preventing a certain probability from being zero.
Formula: Count (A,C) + 1 / Count (C) + Number of values in class
Mean Absolute Error / Deviation (MAE)
(1/n) * ∑( e(i) )
Mean Percentage Error (MPE)
100 * (1/n) * ∑( e(i) / y(i) )