Week 4 Flashcards

Question 1

Q

What is Data Mining?

Answer

A

The nontrivial process of identifying valid, potentially useful, and ultimately understandable patterns in data stored in structured databases.

Question 2

Q

What is the most critical ingredient for data mining?

Question 3

Q

What types of data can be used?

Answer

A

Structured and unstructured

Question 4

Q

What is often the datasource?

Answer

A

A consolidated data warehouse

Question 5

Q

Who is often the end user of mining?

Answer

A

The end user

Question 6

Q

What is essential for Data mining tools?

Answer

A

The capabilities and ease of use of the tools.

Question 7

Q

What are predictions in data mining?

Answer

A

Tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or forecasting the absolute temperature of a particular day.

Question 8

Q

What are associations in data mining?

Answer

A

Find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis.

Question 9

Q

What are clusters in data mining?

Answer

A

Identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors.

Question 10

Q

What type of techniques are part of predictions?

Answer

A

Classification and Regression

Question 11

Q

What type of techniques are part of Association?

Answer

A

Link analysis and Sequence analysis

Question 12

Q

What type of techniques are part of Clustering?

Answer

A

Outlier analysis

Question 13

Q

What does DM and statistics start with?

Answer

A

DM starts with a loosely defined discovery statement and statistics with well defined proposition and hypotheses

Question 14

Q

What set of data does DM and statistics use?

Answer

A

DM uses all existing data to discover novel patterns/relationships, statistics collect a sample of data to test the hypothesis.

Question 15

Q

What are measures of dispersion?

Answer

A

Degree of variation in a given variable

Question 16

Q

What is regression used for?

Answer

A

Regression is used to characterize relationships between explanatory (input) and response (output) variables.

Question 17

Q

What does R² provide?

Answer

A

Provides information about the fit of the model, the higher the R² the more variability is explained by the model.

Question 18

Q

What are the two types of statistics in business analytics?

Answer

A

Descriptive: Describing the data
Inferential: drawing inferences about the population based on sample data (regression)

Question 19

Q

What is supervised learning?

Answer

A

uses labeled data where the datasets are designed to train algorithms into classifying data or predicting outcomes accurately. The model can measure its accuracy and learn over time.

Question 20

Q

What is unsupervised learning?

Answer

A

Uses machine learning algorithms to analyze and cluster unlabeled data sets.

Question 21

Q

What is a characteristic of labeled data?

Answer

A

The values for input as well as output variables are known.

Question 22

Q

What is a simple split?

Answer

A

Split the data into 2 mutually exclusive sets: training and testing (for ANN also validation)

Question 23

Q

What is K-fold Cross Validation?

Answer

A

Split the data into k mutually exclusive subsets. Use each subset as testing while using the rest of the subsets as training.

Question 24

Q

What is classification?

Answer

A

Learning from past data to classify new data, the output variable is categorical.

Question 25

Q

What is the confusion matrix?

Answer

A

Primary source for accuracy estimation in classification.

Question 26

Q

How is the accuracy measured?

Answer

A

TP + TN / (ALL)

Question 27

Q

How is the True Positive Rate measured?

Answer

A

TP / (TP+FN)

Question 28

Q

How is the True Negative Rate measured?

Answer

A

TN / (TN+FP)

Question 29

Q

How is the precision measured?

Answer

A

TP / (TP+FP)

Question 30

Q

What are decision trees?

Answer

A

Supervised mining algorithms that employ a divide-and-conquer method.

Question 31

Q

On what 3 criteria do decision tree algorithms differ?

Answer

A

Splitting criteria
Stopping criteria
Pruning

Question 32

Q

What is the splitting criteria in decision trees?

Answer

A

Which variable, what value, etc.

Question 33

Q

What is the pruning criteria in decision trees?

Answer

A

Pre-pruning vs post-pruning

Question 34

Q

What is the Gini Index?

Answer

A

Measure of inequality in a distribution. It determines the purity of a specific class as a result of a decision to branch along a particular attribute/value.

Question 35

Q

What is often the best splti?

Answer

A

The best split is often the one that increases the purity of the sets resulting from a proposed split.

Question 36

Q

What attribute/split combination is chosen to split the note?

Answer

A

The attribute/split combination that provides the smallest ginisplit is chosen to split the note.

Question 37

Q

What is cluster analysis?

Answer

A

Clustering is used for automatic identification of natural groupings of things. It learns the clusters of things from past data, then assigns new instances.

Question 38

Q

What is the output variable in cluster analysis?

Answer

A

There is no output variable

Question 39

Q

What do most cluster analysis methods use?

Answer

A

Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items.

Question 40

Q

What are the use cases of clustering?

Answer

A

Identify natural groupings of customers;
Identify rules for assigning new cases to classes for targeting/diagnostic purposes.
Provide characteristics, definition, labeling of populations
Decrease the size and complexity of problems for other data mining methods
Identify outliers in a specific domain.

Question 41

Q

What is association Rule Mining?

Answer

A

Finds interesting relationships (affinities) between variables (items or events). There is no output variable and is very popular DM method in business.

Question 42

Q

What is the input of association?

Answer

A

Input: the single point-of-sale transaction data

Question 43

Q

What is the output of association?

Answer

A

The most frequent affinities among items

Question 44

Q

Week 4 Flashcards

The nontrivial process of identifying valid,k novel, potentially useful and ultimately understandable patterns in data stored in structural databases.