Week 4 Flashcards

The nontrivial process of identifying valid,k novel, potentially useful and ultimately understandable patterns in data stored in structural databases.

1
Q

What is Data Mining?

A

The nontrivial process of identifying valid, potentially useful, and ultimately understandable patterns in data stored in structured databases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the most critical ingredient for data mining?

A

Data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What types of data can be used?

A

Structured and unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is often the datasource?

A

A consolidated data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Who is often the end user of mining?

A

The end user

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is essential for Data mining tools?

A

The capabilities and ease of use of the tools.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are predictions in data mining?

A

Tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or forecasting the absolute temperature of a particular day.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are associations in data mining?

A

Find the commonly co-occurring groupings of things, such as beer and diapers going together in market-basket analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are clusters in data mining?

A

Identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their demographics and past purchase behaviors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of techniques are part of predictions?

A

Classification and Regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What type of techniques are part of Association?

A

Link analysis and Sequence analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What type of techniques are part of Clustering?

A

Outlier analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What does DM and statistics start with?

A

DM starts with a loosely defined discovery statement and statistics with well defined proposition and hypotheses

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What set of data does DM and statistics use?

A

DM uses all existing data to discover novel patterns/relationships, statistics collect a sample of data to test the hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are measures of dispersion?

A

Degree of variation in a given variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is regression used for?

A

Regression is used to characterize relationships between explanatory (input) and response (output) variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does R² provide?

A

Provides information about the fit of the model, the higher the R² the more variability is explained by the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two types of statistics in business analytics?

A

Descriptive: Describing the data
Inferential: drawing inferences about the population based on sample data (regression)

19
Q

What is supervised learning?

A

uses labeled data where the datasets are designed to train algorithms into classifying data or predicting outcomes accurately. The model can measure its accuracy and learn over time.

20
Q

What is unsupervised learning?

A

Uses machine learning algorithms to analyze and cluster unlabeled data sets.

21
Q

What is a characteristic of labeled data?

A

The values for input as well as output variables are known.

22
Q

What is a simple split?

A

Split the data into 2 mutually exclusive sets: training and testing (for ANN also validation)

23
Q

What is K-fold Cross Validation?

A

Split the data into k mutually exclusive subsets. Use each subset as testing while using the rest of the subsets as training.

24
Q

What is classification?

A

Learning from past data to classify new data, the output variable is categorical.

25
Q

What is the confusion matrix?

A

Primary source for accuracy estimation in classification.

26
Q

How is the accuracy measured?

A

TP + TN / (ALL)

27
Q

How is the True Positive Rate measured?

A

TP / (TP+FN)

28
Q

How is the True Negative Rate measured?

A

TN / (TN+FP)

29
Q

How is the precision measured?

A

TP / (TP+FP)

30
Q

What are decision trees?

A

Supervised mining algorithms that employ a divide-and-conquer method.

31
Q

On what 3 criteria do decision tree algorithms differ?

A

Splitting criteria
Stopping criteria
Pruning

32
Q

What is the splitting criteria in decision trees?

A

Which variable, what value, etc.

33
Q

What is the pruning criteria in decision trees?

A

Pre-pruning vs post-pruning

34
Q

What is the Gini Index?

A

Measure of inequality in a distribution. It determines the purity of a specific class as a result of a decision to branch along a particular attribute/value.

35
Q

What is often the best splti?

A

The best split is often the one that increases the purity of the sets resulting from a proposed split.

36
Q

What attribute/split combination is chosen to split the note?

A

The attribute/split combination that provides the smallest ginisplit is chosen to split the note.

37
Q

What is cluster analysis?

A

Clustering is used for automatic identification of natural groupings of things. It learns the clusters of things from past data, then assigns new instances.

38
Q

What is the output variable in cluster analysis?

A

There is no output variable

39
Q

What do most cluster analysis methods use?

A

Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items.

40
Q

What are the use cases of clustering?

A

Identify natural groupings of customers;
Identify rules for assigning new cases to classes for targeting/diagnostic purposes.
Provide characteristics, definition, labeling of populations
Decrease the size and complexity of problems for other data mining methods
Identify outliers in a specific domain.

41
Q

What is association Rule Mining?

A

Finds interesting relationships (affinities) between variables (items or events). There is no output variable and is very popular DM method in business.

42
Q

What is the input of association?

A

Input: the single point-of-sale transaction data

43
Q

What is the output of association?

A

The most frequent affinities among items

44
Q
A