Chapter 11: Intro to Data Mining Flashcards
With either CRISP-DM or SEMMA, it is important to fully understand which of the following aspects before preparing the data and selecting analysis techniques:
- The surrounding socioeconomic climate
- business goals
- underlying issues for the business
- political implications
- the surrounding socioeconomic climate
- business goals
- underlying issues for the business
What is a popular systematic approach to managing and conducting data mining projects?
Cross-Industry Standard Process for Data Mining (CRISP-DM) Methodology
Why is the CRISP-DM methodology often preferred to other methodologies?
b/c of its emphasis on business goals and objectives prior to preparing the data and choosing analysis techniques
Data mining uses many kinds of computational algorithms to identify hidden patterns and relationships in data. For developing predictive models, one tends to employ _ data mining techniques
supervised
What is the key distinction for supervised data mining techniques?
When are they best used?
target variable is identified
used for developing predictive models
What is the key distinction for unsupervised data mining techniques?
When are they best used?
no target variable is identified
effective for data exploration, dimension reduction, and pattern recognition
What are common applications of supervised data mining?
- classification model (target variable is categorical; predict class distinction of new cases)
- prediction model (target variable is numerical; predict target fora new case)
What is the term used to describe computer systems that demonstrate human-like intelligence and cognitive functions, such as deduction, pattern recognition, and the interpretation of complex data?
artificial intelligence
What are common applications of unsupervised data mining?
- dimension reduction (convert high-dimensional data into smaller into data with smaller number of variables)
- pattern recognition (reorganizing patterns in data using machine learning techniques)
What are the 6 phases of the CRISP-DM methodology?
- business understanding
- data understanding
- data preparation
- modeling
- evaluation
- deployment
____ measures gauge whether a group of observations are similar or dissimilar to one another
similarity
What is one of the most widely used measures for evaluating similarity with numerical variables. It is defined as the length of a straight line between two observations.
euclidean distance
Data ____ is the process of dividing a data set into a training, a validation, and, in some situations, an optional test data set.
partitioning
What is the formula for the matching coefficient?
matching variables/total number variables
What are the 3 partitions created in data partitioning?
- training set
- validation set
- optional test data set
Recall that data partitioning is the process of dividing a data set into a training, a validation, and an optional test data set. As a common practice, in the oversampling technique, which data set is oversampled?
training data set
What is the formual for Jaccard’s coefficient?
(Number of variables with matching positive outcomes)/((Total number of variables)-(Number of variables with matching negative outcomes))
What are two methods used to detect overfitting and provide objective assessment of the predictive performance of models? Select all that apply!
- data mining
- data partitioning
- over sampling
- cross-validation
- data partitioning
- cross-validation
A common practice in data partitioning is to partition some percent of the data into the training data set and some percent of the data into the validation data set. Which of the answer below is consistent with the percentage of data in training data set and percentage of data in validation data set?
60%; 40%
The ____ technique involves intentionally selecting more samples from one class than from the other class or classes in order to adjust the class distribution of a data set.
oversampling
____ occurs when a predictive model is made overly complex to fit the quirks of given sample data. By making the model conform too closely to the sample data, its predictive power is compromised.
overfitting
What is the name of the chart that shows the improvement that a predictive model provides over a random selection in capturing the target class cases?
cumulative lift chart
Know the formulas for performance measures
In Class Notes
What is the term for a table that summarizes classification outcomes obtained from the validation data set?
Confusion Matrix
When is the RMSE performance measure most desirable?
when large errors are particularly undesirable
What is the name of the chart that shows the improvement that a predictive model provides over a random selection but presents the information in 10 equal-sized intervals (e.g., every 10% of the observations)?
decile-wise lift chart