Business Intelligence & Data Management Flashcards

1
Q

Data

A

Unprocessed raw facts, items that are the most elementary descriptions of things, events, activities, and transactions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Information

A

Organized data that has meaning and value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Knowledge

A

Processed data or information that is applicable to a business decision problem

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Structured data

A

Data that has structure, what you typically find in a database/sheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Unstructured data

A

Data that has no structure, that is complex. Data that you find ‘in the wild’. E.g. images, sound, video, written text.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Data management

A

discipline that focuses on the proper generation, storage and retrieval of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Database

A

A shared and integrated computer structure that stores a collection of end user data and metadata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Database management System (DBMS)

A

Collection of programs that manages the DB structure and controls access to the database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Single user database

A

Database that supports only one user at a time. E.g. desktop database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Multiuser database

A

Database that supports multiple users at a time. E.g. Workgroup database or enterprise database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Centralized database

A

Database for which the data is stored at a single site.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Distributed database

A

Database for for the data is stores across several sites

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Operational database

A

Also known as transactional or production database. A database that support an organization’s day to day operations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Data warehouse (description)

A

A system with the primary focus of storing data to be used to generate information to make tactical or strategic decisions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Semi-structured data

A

Data that you have already prepared to some extent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Extensible Markup Language (XML)

A

A special language used to represent and manipulate data elements in a textual format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

XML Database

A

Database that supports the storage and management of semi-structured XML data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Database design

A

Refers to the activities that focus on the design of a database. If done properly, it requires the designer to identify precisely the database’s expected use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Data model

A

A simple representation of a more complex real world data structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Entity

A

Any physical or conceptual object about which data is to be collected or stored.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Attribute

A

Characteristics of an entity (e.g. weight, class, size, value, name)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Relationship

A

AKA Cardinality. The association between entities. Can be:

One to many (1, M)
Many to many (M, N)
One to one (1, 1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Constraint

A

Restriction placed on an attribute in a database. E.g. Employee age must be between 16 and 99.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Table

A

Contains rows and columns . Is a persistent representation of logical relations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Relational table
Table for which each row can be uniquely identified
26
Primary key
Attribute or combination of attributes that uniquely identify an entity (or row in database). Is also a super key and a primary key. Primary key is always underlined.
27
Key
One or more attributes that determine other attributes. E.g. an order ID determines what products where bought by who.
28
Composite key
A uniquely identifying key that can only uniquely identify an entity because it is comprised of two or more non-uniquely identifying keys. For example, a postal code + house number to identify a house.
29
Functional dependence
Attribute Y is functionally dependent if attribute X determines Y.
30
Full functional dependence
If Attribute X determines attribute Y, but attribute Y is not determined by attribute Z (or any other attribute).
31
Super key
Key that uniquely identifies each rows, that determines all of the entity’s attributes
32
Candidate key
A minimal (irreducible) superkey. A superkey that does not contain a subset of attributes that is itself a superkey.
33
Entity integrity
Each primary key must be unique as to ensure that a each entity is uniquely identified by only one primary key.
34
Controlled redundancy
How relational databases work. By placing foreign keys in tables in database, you are able to refer to an entity from another table. The foreign key is present both in the referring and referred table, thus storing redundant information.
35
Referential integrity
Whether the foreign key contains a value, that value refers to an existing valid tuple in another relation
36
Secondary key
A key that is used strictly for data retrieval purposes only. This key doesn’t necessarily result in a unique outcome (multiple entities can have a shared secondary key).
37
One-to-many (1:M) relationship
For example a mother (1) has two or more children (M). In this case, the kids have (1) mother but she can have more (M) children.
38
One-to-one (1:1) relationship
For example a car can (at one time) only be driven by one person, and that person can only drive one car (at one time).
39
Many-to-many (M:N) relationships
For example a company can have multiple employees and each employee can have multiple employers (jobs). In order to make this relationship in a database you need a compose entity. This is a table that links each unique combined relationship.
40
Compose entity
AKA bridge entity or associative entity. A (linking) table that is used to link two other tables that have a M:N relationship. The compose entity has the primary keys of the other to tables, as to indicate the unique relationship.
41
Normalization
The process of evaluating and correcting table structure to minimize data redundancies. There are 4+1 levels of data redundancy. It is not always required to go to the highest normalization level. More normalization means more join operations.
42
Denormalization
The process of lowering the normal form. For example, 3NF can be converted to 2NF through denormalization. It decreases the number of required operations to manage the database (increasing performance), increases the data redundancy.
43
Data redundancy anomalies
There are three types: - Update anomalies - Insertion anomalies - Deletion anomalies If the same data is stored in multiple tables, you can accidentally forget to update all the versions of the data, you might accidentally delete data you still need, etc.
44
Internal data
Data that comes from within the company
45
External data
Data that comes from elsewhere but is useful for a company
46
Business Intelligence
Is the combination of data warehousing and descriptive analytics. It is an umbrella term that combines processes, technology and tools.
47
Business Analytics
Is the combination of predictive and prescriptive analytics. However in this course both refer to the same thing.
48
Database system
Consists of: Data, Software, Hardware, Users
49
Relational database
Allows data to be grouped into table and sets relationships between tables
50
Structured Query Language (SQL)
Most popular querying language for relational databases. SQL is used at the backend and frontend for many databases management systems.
51
Join (SQL)
Joins two tabel together in multiple different ways: (Inner) Join - returns the entries that are shared between two tables Left/Right outer join - Returns the entire left/right table plus shared entries between both Full outer join - Return both tables in full Minus - Returns one table except entries that occur in common.
52
Production stage
Consists of: | Different platforms and databases. Internal, external, (un)structured data, inconsistent data, limited history
53
Extraction Transformation Load (ETL) stage
Stage that extracts data from operational / production stage. Transforms data to make it fit the warehouse. Then loads into warehouse.
54
Data warehouse (makeup)
Relational DBMS | Is of high quality, subject-oriented, integrated, time-variant and nonvolatile.
55
Data mart
Subset (or small warehouse) of a data warehouse to support a specific Business Unit
56
Metadata
Data about data. For example the location, meaning or origin of other data.
57
Business intelligence (front-end) application
Used for: Querying and reporting Data mining Data visualization
58
Subject-oriented data
Focusses on the analysis of data for the decision makers | Provides a simple view around a particular subject by excluding irrelevant / not useful data
59
Integrated data
Data is combined (integrated) with data from multiple heterogeneous data sources in a clean and consistent way
60
Time-variant data
The ability to store data as it existed in multiple points in time.
61
Non-volatile data
This means that data is not updated once it’s in the datawarehouse. Only new data is being added (for example a new point in time).
62
Six data warehouse architectures
- Direct BI on source systems - Canned DW - Independent data marts with bus architecture - Enterprise DW - Hub & Spoke (enterprise + data marts) - Most popular - Federated DW
63
DW Development approaches
Bottom-up - Start with data marts for each BU, then integrate those Top-down - Start with one large DW, then make smaller marts for each BU
64
Data Lake
Database that holds raw data in its’ native format until it’s needed.
65
Features of DW (Data, structure, processing, security, Users)
``` Data Warehouse Data - Cleaned, aggregated Data structure - Structured, processed Processing -Focus on write Security - Mature Users - Business professionals and data scientists ```
66
Features of Data Late (Data, structure, processing, security, Users)
``` Data Lake Data - Raw Data structure - Unstructured Processing - Focus on read Security - Maturing Users - Data scientists ```
67
Data converting
Converting data types into other types. For example converting integer 0 & 1 to boolean true and false, also currency conversions.
68
Data clearing
Validation and filtering of data
69
Integral load
In this case the ETL stage loads all the data (both changed and unchanged) into the DW at one time
70
Delta / Incremental load
Here the ETL stage only loads the changes since the previous load
71
Online Analytical Processing (OLAP)
Enables users to interactively analyze multi-dimensional data from multiple perspectives. It’s intuitive way to organize large amounts of data. The user acts as a sort of explorer. Data is structured as a cube.
72
Roll up (OLAP)
Aggregating measures to a higher dimension level (from quarter to year)
73
Drill down (OLAP)
Revers of roll up (from months to days)
74
Slice & dice (OLAP)
Selecting subsets of cells that satisfy a certain condition
75
Pivots (OLAP)
Rotates data axis
76
Lattice Cubes
Still need to figure this one out
77
OLAP Software basics
OLAP Software is based on: Measures - the numerical data of thing you want to compute Dimensions - the different perspectives on those measures, more categorical data Dimension hierarchies - structure behind the dimensions
78
Star scheme
A table layout with a fact table (containing numeric data) in the middle, with dimension tables connected to it (containing categorical data). This scheme limits the number of required SQL joins, increasing performance.
79
Snowflake scheme
Variance / refinement of the star scheme. Some dimensional data is morlaized into smaller dimension tables.
80
Central fact table
The central table in a star or snowflake scheme. Multi-valued since all numeric data is stored within the fact table.
81
Dimension table
The surrounding tables in a star of snowflake scheme. Are single valued with meaningless surrogate keys
82
Surrogate key
Are primary keys that replace the original (operational) primary key. This in an effort to save storage space. Has a lot of columns and always a 1:M relation with the fact table
83
Dimensions
Still need to write this one
84
Conforming dimensions
Dimension tables that are shared between multiple fact tables.
85
Data mining
The process of discovering new valuable knowledge in databases. Can be both hypothesis (test a claim) and discovery (explore for knowledge) driven. Can be machine learning, statistics and databases
86
Big data’s five V's
Volume - Long and wide Velocity - High speed Variability - lots of different numbers, texts, images, etc. Veracity - Quality issues Value - Should be valuable to the organization.
87
Deductions
Example: All cows give milk. Betsie is a cow Betsie gives milk
88
Abduction
``` Example: All cows give milk. Betsie gives milk Betsie is a cow This is the wrong way around, ofc. ```
89
Induction
``` Betsie is a cow Betsie gives milk Clarabelle is a cow Clarabelle gives milk All cows give milk Similar to this is how data mining with machine learning works ```
90
Regression analysis
The analysis of statistical relationship between variables: | Y= alpha + beta * X+ residual
91
R-square
A number between 0 and 1. Indicates the goodness of the fit. 1 is perfect, 0 is the worst..
92
Ordinary least Squares (OLS)
Also referred to as lineair regression. Is a method of estimation used in linear regression Minimizes the errors associated with each predicted value for Y Uses square because without squares both positive and negative deviations can appear and cancel each-other out.
93
Regression model selection
Forward selection Create a model with one independent variable and add more until you only generate more noise Backward selection Create a model with all the indep. variables and delete them until you are left with the ones that influence the dep. variable.
94
K-nearest neighbors
Is a classification method. Based on a training set you make an algorithm that is able to determine the class of new data. You also split a validation set from the training set to determine value of K. After determining best K use testing set as final check.
95
Euclidian distance
To calculate the distance between point A and B: Square root of: (Xb-Xa)^2 + (Yb-Ya)^2 It’s based on the Pythagoras theorem.
96
Manhattan distance
To calculate the distance between A and B: | Xb-Xa) + (Yb-Ya
97
Choosing value of K
If K is too small, it will be sensitive to noise and not be representative to the class. If K too large, the model may may include points from other classes. To optimize the K test all of them and check for optimal value.
98
Curse of dimensionality
In case of a high amount of dimensions (with many attributes), everything is ‘far’ apart. This results in no nearby points, unless an extremely large amount of data is found (exponential to the number of dimensions). For 10 dimensions you have 1024 subcubes. You need more than 1024 datapoints in order have on avg 1 per subcube.
99
Naive Bayes formula
P (Y|X) = (P(X|Y) * P(Y)) / P(X) OR P (Y|X1X2Xn) = P(X1X2Xn|Y) * P(Y)/P(X1X2Xn)
100
Precision
The amount of errors in the class prediction Precision of 80% means that 80% of the predictions were correct. Formula: True Positive / True positive + false positive
101
Recall
The fraction of correct class predictions over the actual amount in that class. Recall of 80% means that 80% of the items of a class were found. Formula: True Positive / True positive + false negative
102
F1-Measure
Combination of precision and recall in one score Formula: (2 * Precision * Recall) / (Precision + recall)
103
ROC Curve
The ROC Curve is a plot of the recall against 1-precision.
104
Decision Tree
A data-driven classification method that is easy to explain and can easily be transformed into a ruleset
105
Information Gain
A method to determine the best way to split data in a decision tree. Calculate M0 - M12 and M0 - M34 using the Gini Index. Whichever is the best is the best way to split data.
106
Gini Index
1 - (the probability of "yes”)^2 - (the probability of "no”)^2 The lower, the better the split.
107
Entropy Measure
Similar to the Gini index, but the proportion of the instances that belong to each class K is multiplied by log(2).
108
Recall Formula
tp/(tp+fn)
109
Precision Formula
tp/(tp+fp)
110
Accuracy Formula
(tp+tn)/(tp+tn+fp+fn) or (tp+tn) / n
111
F1-Measure Formula
2*(Precision * Recall) / (Precision + Recall)
112
Error Rate Formula
(fp + fn) / N
113
Univariate
Statistical analysis with a type of data which consists of observations on only a single characteristic or attribute. For example, the salaries of workers in an industry
114
Multivariate
Statistical analysis with a type of data which consists of observations on two or more characteristic or attribute. For example, the salaries of workers in an industry.
115
Binary Split
Trees with only binary (two-way) splits are called binary threes
116
Multiway split
Trees with more than two-way splits are called Multiway trees
117
Overfitting
When you make a model that fits every point of the training data perfectly but fails to capture the underlying structure resulting in bad predictive or classification power on testing data.
118
Underfitting
When you make a model that doesn’t fit the training data enough so that it is unable to capture the underlying (actual) structure
119
Parametric
A analytical method that assumes that the underlying data structure follows a normal distribution and homogenous sample variances
120
Non-parametric
A analytical method that DOES NOT assume that the underlying data structure follows a normal distribution and homogenous sample variances
121
Supervised training
A training method that requires you already know the actual class for the training set, allowing you to train the model so that it will result in those classes on testing data.
122
Unsupervised training
A training method that does not require you to know the actual class for the training set.
123
Antecedent
Something that existed before, or logically precedes something else
124
Consequent
Something that exists after, or logically follows something else (following as a result / effect)
125
Item
A single attribute value pair. For example; Outlook = sunny
126
Itemset
All items occurring in a transaction or record. For example: Outlook = sunny, Humidity = high, PlayTennis = No
127
Frequent itemset
An itemset that has minimal support of K (where K is predefined by the user)
128
Association rule
A rule set that follows the IF - THEN format. For example: IF humidity = high AND play = no THEN windy = false AND outlook = sunny
129
Apriori Property
Is a seminal algorithm for mining frequent itemsets for Boolean association rules. Example: If (milk bread jam) is a frequent itemset, then (milk bread) is also a frequent itemset
130
Interesting association rules
Association rules can be interesting for two reason: Objective measures & Subjective measures
131
Objective association rule measure
A association rule is objectively interesting if it has: Support Confidence Lift
132
Subjective association rule measure
A association rule is subjectively interesting if it: Is unexpected Is actionable
133
(Association rule) Support Formula
frequency (X, Y) / N
134
(Association rule) Confidence Formula
frequency (X, Y) / frequency (X)
135
(Association rule) Lift Formula
``` Support / (Support (Y) * Support (X)) OR Confidence / Prob (RHS) OR Prob (LHS and RHS) / (Prob (LHS) * Prob (RHS) ```
136
Lift value meaning
If Lift = 0, then the frequency of occurence between LHS and RHS is not existant If Lift = 1, then LHS and RHS are independent If Lift > 1, there is a relation between LHS and RHS (LHS is a strong indicator for RHS) If lift < 1 (not yet sure).
137
Confusion Matrix order
True Negative, False Positive | False Negative, True Positive
138
Decision tree node meaning
Red node = Non acceptor Blue node = Accepter Left = Non Acceptor Right = Acceptor Red node shows: 21 [20,1] This means 21 were classified as Non-acceptor. 20 Were truly non-acceptor (true negative). 1 was actually acceptor (false negative)
139
Clustering
Unsupervised classification method where data is divided into natural groups such that objects in the same cluster are similar and objects in different clusters are dissimilar.
140
Clustering methods
Partitional algorithms: Construct various partitions and then evaluate them by some criterion Hierarchical algoritms: Create a hierarchical decomposition of the objects using some criterion
141
Partitional Clustering Method
Pre-specify a desired number of cluster, i.e. ‘K' Assign a cluster to each object (K clusters) Minimise sum of distances within clusters Maximise sum of distances between clusters In this course only K-means is considered
142
Centroid
A virtual object that is used to represent all the physical objects in a cluster
143
K-Means clustering
Partitional clustering approach Each cluster is associated with a centroid (center point of the cluster) Each point is assigned to the cluster with the closest centroid Basic algorithm is very simple
144
K-Means clustering algorithm
1. Select K points as the initial centroids (at random) REPEAT 2. Form K clusters by assigning all points to the closest centroid 3. Recompute the centroid of each cluster UNTIL the centroids don’t change
145
Hierarchical Clustering Method
Hierarchical decomposition of the objects using some criterion. Begin with a matrix with the distances between all the object pairs. Then continue with bottom-up approach (agglomerative)
146
Bottom-Up (agglomerative) Approach
REPEAT 1. Put each item in its’ own cluster 2. Find the best pair to match in new cluster UNTIL all clusters are fused together
147
Single linkage distance
Determined by the distance of the two closest records (nearest neighbors) in the different clusters. Tends to produce long chains / elongated clusters.
148
Complete linkage distance
Determined by the maximum distance between any two records in the different clusters. Tends to produce spherical clusters with consistent diameter.
149
Centroid linkage distance
Calculated as the distance between the centroids of the different clusters.
150
Ward’s method
Starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (Error sum of squares)
151
Dendrogram (Binary tree)
Is a treelike diagram that summarizes the process of clustering. The records are drawn on the X-axis (horizontal). Vertical line reflects distance between records. Use a cutoff value to limit the amount of clusters.
152
Average linkage distance
Calculated as the average distance between all pairs of records between the different clusters. Is less affected by outliers.
153
Aspects to consider when validating clusters
Cluster interpretability Cluster stability Cluster separation Number of clusters
154
Retrospective analytics
Descriptive analytics and diagnostic analytics | Methods: Database management, Data warehousing framework, OLAP databases & Dashboards
155
Prospective analytics
Predictive & Prescriptive analytics | Methods: Data Mining Process and all the python models
156
Static Typing
Variable type is checked at compile-time The variable itself is defined as as a certain type. All values of the variable need to be that type.
157
Dynamic Typing
Variable type is checked at run-time The variable is not defined as a certain type. The value of the variable is assigned a type
158
Interpreter
Program that directly executes instructions written in a programming language
159
Compiler
Program that transforms source code in a lower-level (machine) language
160
Inmon approach
Top-down, Complex, Subject- or data-driven Enterprise-wide (atomic) DW, Feeds departmental DB's
161
Kimball approach
Bottom-up, Simple, Proces Oriented, Dimensional modelling. ‘DW’ is a collection of Data Marts. Data marts model single Business processes and conformed dimensions
162
Transient data source
Name for the supporting databases that store intermediate transformation results (ETL-stage)
163
Transform stage
Simple data conversions (time notation, etc) Complex type conversions (categorical data, address formats) Currency conversions Language conversions
164
Dimension Hierarchy
The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day
165
Dimension Hierarchy
The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day
166
Calculate number of cubes in a multi-dimensional database
(Dimension level + 1) ^ dimensions
167
Star scheme: restating
Overwrite old values. Use with infrequent changes (e.g. customer’s address), or to correct errors. Kills history (overwriting previous values).
168
Nominal Data
AKA Categorical data. Values themselves only serve as label. E.g. Sunny, Overcast, Rain
169
Ordinal Data
Impose order, but no distance between values can be derived. E.g. Hot, Mild, Cold
170
Interval Data
Impose order and distance between value can be derived. However, no natural zero-point. E.g. Temperature in Fahrenheit or the current year.
171
Ratio Data
Impose order and distance between values van be derived. Also has a natural zero point. E.g. Length, weight.
172
SEMMA
Sample, Explore, Modify, Model, Assess
173
CRISP-DM
``` Cross Industry Standard Process for Data Mining Step 1: Business Understanding Step 2: Data Understanding Step 3: Data Preparation (I) **Step 1-3 take about 85% of project time.** Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment ```
174
Laplace Smoothing
Prevents nullification of the Naive Bayes formula by smoothing preventing a certain probability from being zero. Formula: Count (A,C) + 1 / Count (C) + Number of values in class
175
Mean Absolute Error / Deviation (MAE)
(1/n) * ∑( e(i) )
176
Mean Percentage Error (MPE)
100 * (1/n) * ∑( e(i) / y(i) )