Business Intelligence & Data Management Flashcards

Question

Relational table

Answer 1

Table for which each row can be uniquely identified

Answer 2

Attribute or combination of attributes that uniquely identify an entity (or row in database). Is also a super key and a primary key. Primary key is always underlined.

Answer 3

One or more attributes that determine other attributes. E.g. an order ID determines what products where bought by who.

Answer 4

A uniquely identifying key that can only uniquely identify an entity because it is comprised of two or more non-uniquely identifying keys. For example, a postal code + house number to identify a house.

Answer 5

Attribute Y is functionally dependent if attribute X determines Y.

Answer 6

If Attribute X determines attribute Y, but attribute Y is not determined by attribute Z (or any other attribute).

Answer 7

Key that uniquely identifies each rows, that determines all of the entity’s attributes

Answer 8

A minimal (irreducible) superkey. A superkey that does not contain a subset of attributes that is itself a superkey.

Answer 9

Each primary key must be unique as to ensure that a each entity is uniquely identified by only one primary key.

Answer 10

How relational databases work. By placing foreign keys in tables in database, you are able to refer to an entity from another table. The foreign key is present both in the referring and referred table, thus storing redundant information.

Answer 11

Whether the foreign key contains a value, that value refers to an existing valid tuple in another relation

Answer 12

A key that is used strictly for data retrieval purposes only. This key doesn’t necessarily result in a unique outcome (multiple entities can have a shared secondary key).

Answer 13

For example a mother (1) has two or more children (M). In this case, the kids have (1) mother but she can have more (M) children.

Answer 14

For example a car can (at one time) only be driven by one person, and that person can only drive one car (at one time).

Answer 15

For example a company can have multiple employees and each employee can have multiple employers (jobs). In order to make this relationship in a database you need a compose entity. This is a table that links each unique combined relationship.

Answer 16

AKA bridge entity or associative entity. A (linking) table that is used to link two other tables that have a M:N relationship. The compose entity has the primary keys of the other to tables, as to indicate the unique relationship.

Answer 17

The process of evaluating and correcting table structure to minimize data redundancies. There are 4+1 levels of data redundancy. It is not always required to go to the highest normalization level. More normalization means more join operations.

Answer 18

The process of lowering the normal form. For example, 3NF can be converted to 2NF through denormalization. It decreases the number of required operations to manage the database (increasing performance), increases the data redundancy.

Answer 19

There are three types: - Update anomalies - Insertion anomalies - Deletion anomalies If the same data is stored in multiple tables, you can accidentally forget to update all the versions of the data, you might accidentally delete data you still need, etc.

Answer 20

Data that comes from within the company

Answer 21

Data that comes from elsewhere but is useful for a company

Answer 22

Is the combination of data warehousing and descriptive analytics. It is an umbrella term that combines processes, technology and tools.

Answer 23

Is the combination of predictive and prescriptive analytics. However in this course both refer to the same thing.

Answer 24

Consists of: Data, Software, Hardware, Users

Answer 25

Allows data to be grouped into table and sets relationships between tables

Answer 26

Most popular querying language for relational databases. SQL is used at the backend and frontend for many databases management systems.

Answer 27

Joins two tabel together in multiple different ways: (Inner) Join - returns the entries that are shared between two tables Left/Right outer join - Returns the entire left/right table plus shared entries between both Full outer join - Return both tables in full Minus - Returns one table except entries that occur in common.

Answer 28

Consists of: | Different platforms and databases. Internal, external, (un)structured data, inconsistent data, limited history

Answer 29

Stage that extracts data from operational / production stage. Transforms data to make it fit the warehouse. Then loads into warehouse.

Answer 30

Relational DBMS | Is of high quality, subject-oriented, integrated, time-variant and nonvolatile.

Answer 31

Subset (or small warehouse) of a data warehouse to support a specific Business Unit

Answer 32

Data about data. For example the location, meaning or origin of other data.

Answer 33

Used for: Querying and reporting Data mining Data visualization

Answer 34

Focusses on the analysis of data for the decision makers | Provides a simple view around a particular subject by excluding irrelevant / not useful data

Answer 35

Data is combined (integrated) with data from multiple heterogeneous data sources in a clean and consistent way

Answer 36

The ability to store data as it existed in multiple points in time.

Answer 37

This means that data is not updated once it’s in the datawarehouse. Only new data is being added (for example a new point in time).

Answer 38

- Direct BI on source systems - Canned DW - Independent data marts with bus architecture - Enterprise DW - Hub & Spoke (enterprise + data marts) - Most popular - Federated DW

Answer 39

Bottom-up - Start with data marts for each BU, then integrate those Top-down - Start with one large DW, then make smaller marts for each BU

Answer 40

Database that holds raw data in its’ native format until it’s needed.

Answer 41

``` Data Warehouse Data - Cleaned, aggregated Data structure - Structured, processed Processing -Focus on write Security - Mature Users - Business professionals and data scientists ```

Answer 42

``` Data Lake Data - Raw Data structure - Unstructured Processing - Focus on read Security - Maturing Users - Data scientists ```

Answer 43

Converting data types into other types. For example converting integer 0 & 1 to boolean true and false, also currency conversions.

Answer 44

Validation and filtering of data

Answer 45

In this case the ETL stage loads all the data (both changed and unchanged) into the DW at one time

Answer 46

Here the ETL stage only loads the changes since the previous load

Answer 47

Enables users to interactively analyze multi-dimensional data from multiple perspectives. It’s intuitive way to organize large amounts of data. The user acts as a sort of explorer. Data is structured as a cube.

Answer 48

Aggregating measures to a higher dimension level (from quarter to year)

Answer 49

Revers of roll up (from months to days)

Answer 50

Selecting subsets of cells that satisfy a certain condition

Answer 51

Rotates data axis

Answer 52

Still need to figure this one out

Answer 53

OLAP Software is based on: Measures - the numerical data of thing you want to compute Dimensions - the different perspectives on those measures, more categorical data Dimension hierarchies - structure behind the dimensions

Answer 54

A table layout with a fact table (containing numeric data) in the middle, with dimension tables connected to it (containing categorical data). This scheme limits the number of required SQL joins, increasing performance.

Answer 55

Variance / refinement of the star scheme. Some dimensional data is morlaized into smaller dimension tables.

Answer 56

The central table in a star or snowflake scheme. Multi-valued since all numeric data is stored within the fact table.

Answer 57

The surrounding tables in a star of snowflake scheme. Are single valued with meaningless surrogate keys

Answer 58

Are primary keys that replace the original (operational) primary key. This in an effort to save storage space. Has a lot of columns and always a 1:M relation with the fact table

Answer 59

Still need to write this one

Answer 60

Dimension tables that are shared between multiple fact tables.

Answer 61

The process of discovering new valuable knowledge in databases. Can be both hypothesis (test a claim) and discovery (explore for knowledge) driven. Can be machine learning, statistics and databases

Answer 62

Volume - Long and wide Velocity - High speed Variability - lots of different numbers, texts, images, etc. Veracity - Quality issues Value - Should be valuable to the organization.

Answer 63

Example: All cows give milk. Betsie is a cow Betsie gives milk

Answer 64

``` Example: All cows give milk. Betsie gives milk Betsie is a cow This is the wrong way around, ofc. ```

Answer 65

``` Betsie is a cow Betsie gives milk Clarabelle is a cow Clarabelle gives milk All cows give milk Similar to this is how data mining with machine learning works ```

Answer 66

The analysis of statistical relationship between variables: | Y= alpha + beta * X+ residual

Answer 67

A number between 0 and 1. Indicates the goodness of the fit. 1 is perfect, 0 is the worst..

Answer 68

Also referred to as lineair regression. Is a method of estimation used in linear regression Minimizes the errors associated with each predicted value for Y Uses square because without squares both positive and negative deviations can appear and cancel each-other out.

Answer 69

Forward selection Create a model with one independent variable and add more until you only generate more noise Backward selection Create a model with all the indep. variables and delete them until you are left with the ones that influence the dep. variable.

Answer 70

Is a classification method. Based on a training set you make an algorithm that is able to determine the class of new data. You also split a validation set from the training set to determine value of K. After determining best K use testing set as final check.

Answer 71

To calculate the distance between point A and B: Square root of: (Xb-Xa)^2 + (Yb-Ya)^2 It’s based on the Pythagoras theorem.

Answer 72

To calculate the distance between A and B: | Xb-Xa) + (Yb-Ya

Answer 73

If K is too small, it will be sensitive to noise and not be representative to the class. If K too large, the model may may include points from other classes. To optimize the K test all of them and check for optimal value.

Answer 74

In case of a high amount of dimensions (with many attributes), everything is ‘far’ apart. This results in no nearby points, unless an extremely large amount of data is found (exponential to the number of dimensions). For 10 dimensions you have 1024 subcubes. You need more than 1024 datapoints in order have on avg 1 per subcube.

Answer 75

P (Y|X) = (P(X|Y) * P(Y)) / P(X) OR P (Y|X1X2Xn) = P(X1X2Xn|Y) * P(Y)/P(X1X2Xn)

Answer 76

The amount of errors in the class prediction Precision of 80% means that 80% of the predictions were correct. Formula: True Positive / True positive + false positive

Answer 77

The fraction of correct class predictions over the actual amount in that class. Recall of 80% means that 80% of the items of a class were found. Formula: True Positive / True positive + false negative

Answer 78

Combination of precision and recall in one score Formula: (2 * Precision * Recall) / (Precision + recall)

Answer 79

The ROC Curve is a plot of the recall against 1-precision.

Answer 80

A data-driven classification method that is easy to explain and can easily be transformed into a ruleset

Answer 81

A method to determine the best way to split data in a decision tree. Calculate M0 - M12 and M0 - M34 using the Gini Index. Whichever is the best is the best way to split data.

Answer 82

1 - (the probability of "yes”)^2 - (the probability of "no”)^2 The lower, the better the split.

Answer 83

Similar to the Gini index, but the proportion of the instances that belong to each class K is multiplied by log(2).

Answer 84

tp/(tp+fn)

Answer 85

tp/(tp+fp)

Answer 86

(tp+tn)/(tp+tn+fp+fn) or (tp+tn) / n

Answer 87

2*(Precision * Recall) / (Precision + Recall)

Answer 88

(fp + fn) / N

Answer 89

Statistical analysis with a type of data which consists of observations on only a single characteristic or attribute. For example, the salaries of workers in an industry

Answer 90

Statistical analysis with a type of data which consists of observations on two or more characteristic or attribute. For example, the salaries of workers in an industry.

Answer 91

Trees with only binary (two-way) splits are called binary threes

Answer 92

Trees with more than two-way splits are called Multiway trees

Answer 93

When you make a model that fits every point of the training data perfectly but fails to capture the underlying structure resulting in bad predictive or classification power on testing data.

Answer 94

When you make a model that doesn’t fit the training data enough so that it is unable to capture the underlying (actual) structure

Answer 95

A analytical method that assumes that the underlying data structure follows a normal distribution and homogenous sample variances

Answer 96

A analytical method that DOES NOT assume that the underlying data structure follows a normal distribution and homogenous sample variances

Answer 97

A training method that requires you already know the actual class for the training set, allowing you to train the model so that it will result in those classes on testing data.

Answer 98

A training method that does not require you to know the actual class for the training set.

Answer 99

Something that existed before, or logically precedes something else

Answer 100

Something that exists after, or logically follows something else (following as a result / effect)

Answer 101

A single attribute value pair. For example; Outlook = sunny

Answer 102

All items occurring in a transaction or record. For example: Outlook = sunny, Humidity = high, PlayTennis = No

Answer 103

An itemset that has minimal support of K (where K is predefined by the user)

Answer 104

A rule set that follows the IF - THEN format. For example: IF humidity = high AND play = no THEN windy = false AND outlook = sunny

Answer 105

Is a seminal algorithm for mining frequent itemsets for Boolean association rules. Example: If (milk bread jam) is a frequent itemset, then (milk bread) is also a frequent itemset

Answer 106

Association rules can be interesting for two reason: Objective measures & Subjective measures

Answer 107

A association rule is objectively interesting if it has: Support Confidence Lift

Answer 108

A association rule is subjectively interesting if it: Is unexpected Is actionable

Answer 109

frequency (X, Y) / N

Answer 110

frequency (X, Y) / frequency (X)

Answer 111

``` Support / (Support (Y) * Support (X)) OR Confidence / Prob (RHS) OR Prob (LHS and RHS) / (Prob (LHS) * Prob (RHS) ```

Answer 112

If Lift = 0, then the frequency of occurence between LHS and RHS is not existant If Lift = 1, then LHS and RHS are independent If Lift > 1, there is a relation between LHS and RHS (LHS is a strong indicator for RHS) If lift < 1 (not yet sure).

Answer 113

True Negative, False Positive | False Negative, True Positive

Answer 114

Red node = Non acceptor Blue node = Accepter Left = Non Acceptor Right = Acceptor Red node shows: 21 [20,1] This means 21 were classified as Non-acceptor. 20 Were truly non-acceptor (true negative). 1 was actually acceptor (false negative)

Answer 115

Unsupervised classification method where data is divided into natural groups such that objects in the same cluster are similar and objects in different clusters are dissimilar.

Answer 116

Partitional algorithms: Construct various partitions and then evaluate them by some criterion Hierarchical algoritms: Create a hierarchical decomposition of the objects using some criterion

Answer 117

Pre-specify a desired number of cluster, i.e. ‘K' Assign a cluster to each object (K clusters) Minimise sum of distances within clusters Maximise sum of distances between clusters In this course only K-means is considered

Answer 118

A virtual object that is used to represent all the physical objects in a cluster

Answer 119

Partitional clustering approach Each cluster is associated with a centroid (center point of the cluster) Each point is assigned to the cluster with the closest centroid Basic algorithm is very simple

Answer 120

1. Select K points as the initial centroids (at random) REPEAT 2. Form K clusters by assigning all points to the closest centroid 3. Recompute the centroid of each cluster UNTIL the centroids don’t change

Answer 121

Hierarchical decomposition of the objects using some criterion. Begin with a matrix with the distances between all the object pairs. Then continue with bottom-up approach (agglomerative)

Answer 122

REPEAT 1. Put each item in its’ own cluster 2. Find the best pair to match in new cluster UNTIL all clusters are fused together

Answer 123

Determined by the distance of the two closest records (nearest neighbors) in the different clusters. Tends to produce long chains / elongated clusters.

Answer 124

Determined by the maximum distance between any two records in the different clusters. Tends to produce spherical clusters with consistent diameter.

Answer 125

Calculated as the distance between the centroids of the different clusters.

Answer 126

Starts with n clusters, each containing a single object. These n clusters are combined to make one cluster containing all objects. At each step, the process makes a new cluster that minimizes variance, measured by an index called E (Error sum of squares)

Answer 127

Is a treelike diagram that summarizes the process of clustering. The records are drawn on the X-axis (horizontal). Vertical line reflects distance between records. Use a cutoff value to limit the amount of clusters.

Answer 128

Calculated as the average distance between all pairs of records between the different clusters. Is less affected by outliers.

Answer 129

Cluster interpretability Cluster stability Cluster separation Number of clusters

Answer 130

Descriptive analytics and diagnostic analytics | Methods: Database management, Data warehousing framework, OLAP databases & Dashboards

Answer 131

Predictive & Prescriptive analytics | Methods: Data Mining Process and all the python models

Answer 132

Variable type is checked at compile-time The variable itself is defined as as a certain type. All values of the variable need to be that type.

Answer 133

Variable type is checked at run-time The variable is not defined as a certain type. The value of the variable is assigned a type

Answer 134

Program that directly executes instructions written in a programming language

Answer 135

Program that transforms source code in a lower-level (machine) language

Answer 136

Top-down, Complex, Subject- or data-driven Enterprise-wide (atomic) DW, Feeds departmental DB's

Answer 137

Bottom-up, Simple, Proces Oriented, Dimensional modelling. ‘DW’ is a collection of Data Marts. Data marts model single Business processes and conformed dimensions

Answer 138

Name for the supporting databases that store intermediate transformation results (ETL-stage)

Answer 139

Simple data conversions (time notation, etc) Complex type conversions (categorical data, address formats) Currency conversions Language conversions

Answer 140

The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day

Answer 141

The hierarchy in a dimensions. For example hierarchy in the dimension time: All, Year, Month, Day

Answer 142

(Dimension level + 1) ^ dimensions

Answer 143

Overwrite old values. Use with infrequent changes (e.g. customer’s address), or to correct errors. Kills history (overwriting previous values).

Answer 144

AKA Categorical data. Values themselves only serve as label. E.g. Sunny, Overcast, Rain

Answer 145

Impose order, but no distance between values can be derived. E.g. Hot, Mild, Cold

Answer 146

Impose order and distance between value can be derived. However, no natural zero-point. E.g. Temperature in Fahrenheit or the current year.

Answer 147

Impose order and distance between values van be derived. Also has a natural zero point. E.g. Length, weight.

Answer 148

Sample, Explore, Modify, Model, Assess

Answer 149

``` Cross Industry Standard Process for Data Mining Step 1: Business Understanding Step 2: Data Understanding Step 3: Data Preparation (I) **Step 1-3 take about 85% of project time.** Step 4: Model Building Step 5: Testing and Evaluation Step 6: Deployment ```

Answer 150

Prevents nullification of the Naive Bayes formula by smoothing preventing a certain probability from being zero. Formula: Count (A,C) + 1 / Count (C) + Number of values in class

Answer 151

(1/n) * ∑( e(i) )

Answer 152

100 * (1/n) * ∑( e(i) / y(i) )

Business Intelligence & Data Management Flashcards

(176 cards)