Chapter 11: Intro to Data Mining Flashcards

1
Q

With either CRISP-DM or SEMMA, it is important to fully understand which of the following aspects before preparing the data and selecting analysis techniques:
- The surrounding socioeconomic climate
- business goals
- underlying issues for the business
- political implications

A
  • the surrounding socioeconomic climate
  • business goals
  • underlying issues for the business
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a popular systematic approach to managing and conducting data mining projects?

A

Cross-Industry Standard Process for Data Mining (CRISP-DM) Methodology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why is the CRISP-DM methodology often preferred to other methodologies?

A

b/c of its emphasis on business goals and objectives prior to preparing the data and choosing analysis techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data mining uses many kinds of computational algorithms to identify hidden patterns and relationships in data. For developing predictive models, one tends to employ _ data mining techniques

A

supervised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the key distinction for supervised data mining techniques?

When are they best used?

A

target variable is identified

used for developing predictive models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the key distinction for unsupervised data mining techniques?

When are they best used?

A

no target variable is identified

effective for data exploration, dimension reduction, and pattern recognition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are common applications of supervised data mining?

A
  • classification model (target variable is categorical; predict class distinction of new cases)
  • prediction model (target variable is numerical; predict target fora new case)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the term used to describe computer systems that demonstrate human-like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data?

A

artificial intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are common applications of unsupervised data mining?

A
  • dimension reduction (convert high-dimensional data into smaller into data with smaller number of variables)
  • pattern recognition (reorganizing patterns in data using machine learning techniques)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the 6 phases of the CRISP-DM methodology?

A
  1. business understanding
  2. data understanding
  3. data preparation
  4. modeling
  5. evaluation
  6. deployment
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

____ measures gauge whether a group of observations are similar or dissimilar to one another

A

similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations.

A

euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data ____ is the process of dividing a data set into a training, a validation, and, in some situations, an optional test data set.

A

partitioning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the formula for the matching coefficient?

A

matching variables/total number variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 3 partitions created in data partitioning?

A
  • training set
  • validation set
  • optional test data set
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Recall that data partitioning is the process of dividing a data set into a training, a validation, and an optional test data set. As a common practice, in the oversampling technique, which data set is oversampled?

A

training data set

17
Q

What is the formual for Jaccard’s coefficient?

A

(Number of variables with matching positive outcomes)/((Total number of variables)-(Number of variables with matching negative outcomes))

18
Q

What are two methods used to detect overfitting and provide objective assessment of the predictive performance of models? Select all that apply!

  • data mining
  • data partitioning
  • over sampling
  • cross-validation
A
  • data partitioning
  • cross-validation
19
Q

A common practice in data partitioning is to partition some percent of the data into the training data set and some percent of the data into the validation data set. Which of the answer below is consistent with the percentage of data in training data set and percentage of data in validation data set?

20
Q

The ____ technique involves intentionally selecting more samples from one class than from the other class or classes in order to adjust the class distribution of a data set.

A

oversampling

21
Q

____ occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised.

A

overfitting

22
Q

What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases?

A

cumulative lift chart

23
Q

Know the formulas for performance measures

A

In Class Notes

24
Q

What is the term for a table that summarizes classification outcomes obtained from the validation data set?

A

Confusion Matrix

25
Q

When is the RMSE performance measure most desirable?

A

when large errors are particularly undesirable

26
Q

What is the name of the chart that shows the improvement that a predictive model provides over a random selection but presents the information in 10 equal-sized intervals (e.g., every 10% of the observations)?

A

decile-wise lift chart