Data Mining For Business Intelligence Book Flashcards
List the steps of the Data Modeling Process
1 Define Purpose 2 Obtain Data 3 Explore and Clean Data 4 Determine Data Modeling Task 5 Choose Data Modeling Methods 6 Apply methods, select final model 7 Evaluate Performance 8 Deploy
Classification
Classification is perhaps the most basic form of data analysis. The recipient of an offer can respond or not respond. An applicant for a loan can repay on time, repay late, or declare bankruptcy. A credit card transaction can be normal or fraudulent.
A common task in data mining is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be.
Prediction
Prediction is similar to classification, except that we are trying to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g., purchaser or nonpurchaser).
the term prediction in this book refers to the prediction of the value of a continuous variable. (Sometimes in the data mining literature, the term estimation is used to refer to the prediction of the value of a continuous variable, and prediction may be used for both continuous and categorical data.)
Association Rules
Large databases of customer transactions lend themselves naturally to the analysis of associations among items purchased, or “what goes with what.”
Association rules, or affinity analysis, can then be used in a variety of ways. For example, grocery stores can use such information after a customer’s purchases have all
Predictive Analytics
Classification, prediction, and to some extent, affinity analysis constitute the analytical methods employed in predictive analytics.
Data Reduction
Sensible data analysis often requires distillation of complex data into simpler data. Rather than dealing with thousands of product types, an analyst might wish to group them into a smaller number of groups. This process of consolidating a large number of variables (or cases) into a smaller set is termed data reduction.
Data Exploration
Unless our data project is very narrowly focused on answering a specific ques- tion determined in advance (in which case it has drifted more into the realm of statistical analysis than of data mining), an essential part of the job is to re- view and examine the data to see what messages they hold, much as a detective might survey a crime scene.
Here, full understanding of the data may require a reduction in its scale or dimension to allow us to see the forest without getting lost in the trees. Similar variables (i.e., variables that supply similar information) might be aggregated into a single variable incorporating all the similar variables. Analogously, records might be aggregated into groups of similar records.
Data Visualization
Another technique for exploring data to see what information they hold is through graphical analysis. This includes looking at each variable separately as well as looking at relationships between variables.
For numerical variables, we use histograms and boxplots to learn about the distribution of their values, to detect outliers (extreme observations), and to find other information that is relevant to the analysis task.
for categorical variables we use bar charts. We can also look at scatterplots of pairs of numerical variables to learn about possible relationships, the type of relationship, and again, to detect outliers
Supervised learning algorithms
Algorithms used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known.
Simple linear regression analysis is an example of supervised learning. A regression line is drawn to minimize the sum of squared deviations between the actual Y values and the values predicted by this line. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value.
Unsupervised learning algorithms
are those used where there is no outcome variable to predict or classify. Hence, there is no “learning” from cases where such an outcome variable is known. Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods.
Steps in Data Mining
- Develop an understanding of the purpose of the data mining project.
- Obtain the dataset to be used in the analysis.
- Explore, clean, and preprocess the data.
- Reduce the data, if necessary, and (where supervised training is involved) separate them into training, validation, and test datasets.
- Determine the data mining task (classification, prediction, clustering, etc.).
- Choose the data mining techniques to be used (regression, neural nets, hierar- chical clustering, etc.).
- Use algorithms to perform the task.
- Interpret the results of the algorithms.
- Deploy the model.
SEMMA
Sample Explore Modify Model Assess
3 Common types of measuring prediction error
average error, simply the average of the residuals (errors).
The RMS error (root-mean-squared error) is perhaps the most useful term of all. It takes the square root of the average squared error; thus, it gives an idea of the typical error (whether positive or negative) in the same scale as that used for the original data.
The total sum of squared errors adds up the squared errors, so whether an error is positive or negative, it contributes just the same. However, this sum does not yield information about the size of the typical error.