Test 1 ISYS 4293 Flashcards
Business Intelligence and Data Mining
Data mining is a collection of knowledge-discovery technologies used to perform Business Intelligence in order to support an organization’s decision-making
Cross Industry Standard Process-DM
is how we do data mining
(1) Business Problem Understanding
-Define business requirements and objectives
-Translate objectives into data mining problem definition
-Prepare initial strategy to meet objectives
(2) Data Understanding Phase
-Collect data
-Assess data quality
-Perform exploratory data analysis (EDA)
(3) Data Preparation Phase
-Cleanse, prepare, and transform data set
-Prepares for modeling in subsequent phases
-Select cases and variables appropriate for analysis
(4) Modeling Phase
-Select and apply one or more modeling techniques
-Calibrate model settings to optimize results
-If necessary, additional data preparation may be required
(5) Evaluation Phase
-Evaluate one or more models for effectiveness
-Determine whether defined objectives are achieved
-Make decision regarding data mining results before deploying to field
(6) Deployment Phase
-Make use of models created
-Simple deployment: generate report
-Complex deployment: implement additional data mining effort in another department
-In business, customer often carries out deployment based on model
How many data mining tasks?
6 data mining task
Data Mining Task: Description
-Describes general patterns and trends
-Easy to interpret and explain
-Transparent Models
-Pictures and #’s
-E.g. Scatterplots, Descriptive Stats
Data Mining Task: Estimation
-Target Variable = Numerical
-Numerical Predictor/Categorical (IV’s) values to approximate changes in Numerical Target Variables(DV’s)
-Ex: Estimate a student’s Graduate GPA from their Undergrad GPA
-E.g. Correlation, Linear Regression
Data Mining Task: Classification
-target variables (DV’s) = categorical
-Examples:
Simple vs Complex tasks
Fraudulent card transactions
Income brackets(ex. high, middle, low)
Data Mining Task: Prediction
-Results lie in the future
-There is a time component in this task
-Ex: What is the probability of Razorbacks winning a game with a particular combination of player profiles?
Data Mining Task: Association
-Finding attributes of data that go together
-Profiling relationships between two or more attributes
-Understand the consequent behaviors when based on prior behaviors
-Ex: Supermarkets use affinity analysis to see what items are purchased together
Data Mining Task: Clustering
-no target variables
-segmentation of data
-Ex: Focused marketing campaigns
Data mining Task: Learning Types
Supervised and Unsupervised
Supervised
-Have a target variable
-Task:
Classification(Categorical Target Variable)
Estimation (Numeric Target Variable)
Description
Prediction
Unsupervised
-No target variable
-Task:
Association
Clustering
Fallacy 1:
-Set of tools can be turned loose on data repositories
-Finds answers to all business problems
Reality 1:
-No automatic data mining tools solve problems
-Rather, data mining is a process (CRISP-DM)
-Integrates into overall business objectives
Fallacy 2:
-Data mining process is autonomous
-Requires little oversight
Reality 2:
-Requires significant intervention during every phase
-After model deployment, new models require updates
-Continuous evaluative measures monitored by analysts
Fallacy 3:
-Data mining quickly pays for itself
Reality 3:
-Return rates vary
-Depending on startup, personnel, data preparation costs, etc.
Fallacy 4:
Data mining software easy to use
Reality 4:
-Ease of use varies across projects
-Analysts must combine subject matter knowledge with specific problem domain
Fallacy 5:
Data mining identifies causes of business problems
Reality 5:
-Knowledge discovery process uncovers patterns of behavior
-Humans interpret results and identify causes
Fallacy 6:
-Data mining automatically cleans data in databases
Reality 6:
-Data mining often uses data from legacy systems
-Data possibly not examined or used in years
-Organizations starting data mining efforts confronted with huge data preprocessing task
Fallacy 7:
-Data mining will always yield positive results
Reality 7:
-Not guaranteed for positive results
-Can sometimes provide actionable results and improve decisions, but not always
Data preparation
60% of effort for data mining process
Data Cleaning
-Replacement Missing Value
-Normalization, converting variables to standardized scale
-Testing for Normality
-Dummy Variables
-Outliers
Why Preprocess data
-Raw data may often be incomplete, noisy
-Data often from legacy databases where values are missing or non relevant
-Data in form not suitable for data mining; Obsolete fields; Outliers
Three Alternate Methods For Replacing Data
-Replace Missing Values with User-defined Constant
-Replace Missing Values with Mode or Mean/Median
-Replace Missing Values with Random Values
Replace Values with User-Defined Constant
-Missing numeric values replaced with 0.0
-Missing categorical values replaced with “Missing
Replace Missing Values with Mode or Mean/Median
-Mode for categorical field
-Mean/Median for continuous field
Replace Missing Values with Random Values
-Values randomly taken from underlying distribution
-Method superior compared to mean substitution
-Measures of location and spread remain closer to original
Data Transformation: Normalization
-Standardizes scale of effect each variable has on results and The mean and variance or range of every variable
(numeric field values should be normalized)
Min-Max Normalization
-Determines how much greater field value is than minimum value for field
-Scales this difference by field’s range
-X* stands for “min-max normalized X”
Z-score Standardization
-Widely used in statistical analysis
-Takes difference between field value and field value mean
-Scales this difference by field’s standard deviation
-Range [-3,3]
-Data values equal to field’s mean have z-score Standardization value = 0
-Data values that lie above the mean have positive z-score Standardization values
In Z-score Standardization:
-Data values equal to field’s mean
have z-score Standardization value = 0
In Z-score Standardization:
-Data values that lie above the mean
have positive z-score Standardization values
In Z-score Standardization:
-Data Values that lie below mean
Have negative z-score Standardization Values
Normality
to transform variable so that its distribution is closer to normal without changing its basic information
Data Transformation: Normality
Common transformations:
-Natural log = ln(bank)
-Square root = √𝐵𝑎𝑛𝑘
-Inverse square root = 1/√𝐵𝑎𝑛𝑘
Right-skewed data
mean > median; skewness is positive
Left-skewed data
mean < median; skewness is negative
Symmetric data
mean = median = mode; skewness is zero
Outliers
-values that lie near extreme limits of data range
-Outliers may represent errors in data entry
Z-score Standardization
sensitive to outliers
Min Max normalization
sensitive to variation
Interquartile Range (IQR)
-Used to identify Outliers
-Robust statistical method and less sensitive to presence of outliers
-measure of variability
Hypothesis
A statement or claim about a parameter
Null Hypothesis
represents assumed value
Alternative Hypothesis
represents alternative claim about the value
Statistical Inference
Methods for estimating and testing hypotheses about population characteristics based on information contained in a sample
A data analyst meets with superiors to discuss
whether to use kNN or Association on the data
Modeling Phase(Still discussing which model to use)
Chief Analyst meets with CIO, who says that she
would like to investigate and scope out how
analytics can be used in HR hiring projects?
Business Understanding Phase(look at investigate and scope out)
Estimate the amount of money a randomly chosen family of 4
will be shopping given a time and date?
estimation or prediction (target variable, categorial or continuous, numerical or nonnumerical)
Forecast the stock price of Microsoft for next year?
estimation or prediction
What does this equation represent?
zi= (xi-x_)/s
z-score equation
zi= zscore
xi= observed value
x_ = mean of sample
s= standard deviation
Simple linear regression equation
𝑦=𝜷_𝟎+𝜷_𝟏 𝑥+𝜺
What is the use of standardizing
variables?
Automatically remove outliers in the variables
* Convert variables to a same scale *
* Helps in computing IQR
* Make interpretation of the results easier
When Handling Missing Data, one could,
– Replace Missing Values with User-defined Constant
– Replace Missing Values with Mode or Mean/Median
– Replace Missing Values with Random Values
– All of the above *
IQR is more robust than Z-score method for
outlier detection, however, it is highly sensitive
to mean and standard deviation
- True
- False * (look at highly sensitive)
- It is depends on the context
- It is depends on the observations
- Only 3 and 4
Is IQR or zscore more sensitive
zscore
In data mining tasks, one could reduce the
margin of errors by…
Reducing the sample size
– Increasing the sample size *
– Changing the standard deviation
– Keeping the sample size constant
Normalization of the data can be done using
None of the above (min-max equation) x ′ = ( x − x m i n ) / ( x m a x − x m i n )
In Forward Regression, you start with all variables of
interest in the model and then at each step, the least
significant variable is dropped, assuming it’s p-value is
above a pre-set level (α = .05 or .10)
false
Before running a k-nearest neighbor model it is required to
set the number of neighbors to compare instances to
In k-nearest neighbor, distance for categorical
variable can be computed by
Different function
Choose appropriate fit statistics for estimation model
selection:
Misclassification
– Gini Coeff
– Average Squared Error *
– Schwarz’s Bayesian Criterion *
– Average Profit/Loss
– Log Likelihood *
appropriate fit statistics for rankings model
ROC Index *
Gina Coefficient *
Choose appropriate fit statistics for decision model
selection:
Misclassification *
– Gini Coeff
– ASE
– MSE
– Average Profit/Loss *
– Log Likelihood
- kolmorgov smirnov statistic *
Which is true when modeling a Decision Tree?
Each variable is evaluated at each node to determine the splitting
variable
* The same variable may be used for splitting at different locations in
the Decision Tree
* CART (Phi) / information gain criteria can be used for selecting
candidate splits
* If not pruned, a stopping criterion in creating a Decision Tree is
when the tree reaches the leaf nodes
* All of the above *
Categorical data
- Labels or names used to identify an attribute of
each element - Generally qualitative
- Nominal or ordinal
Quantitative data
- indicates how many or how much
- Either discrete or continuous
The sum of differences between xi and xbar =
0
Variance
- measures how far a set of numbers are spread out from their average value.
Xi - Xbar = Varianc
Overfitting
-when your model memorizes your exact training data but doesn’t figure out the pattern in the data
-Fits the model too much
Underfitting
is when you model is too simple
-model isn’t complex enough to match the training data
continuous variable
use the two-sample t test for the difference in means
flag variable
use the two-sample Z test for the difference in proportions
multinomial variable
use the test for the homogeneity of proportions
goodness of fit equation:
Φ(s|t) = 2PlPr ∑|P(j|tL) - P(j|tR)|