Test 1 ISYS 4293 Flashcards
Business Intelligence and Data Mining
Data mining is a collection of knowledge-discovery technologies used to perform Business Intelligence in order to support an organization’s decision-making
Cross Industry Standard Process-DM
is how we do data mining
(1) Business Problem Understanding
-Define business requirements and objectives
-Translate objectives into data mining problem definition
-Prepare initial strategy to meet objectives
(2) Data Understanding Phase
-Collect data
-Assess data quality
-Perform exploratory data analysis (EDA)
(3) Data Preparation Phase
-Cleanse, prepare, and transform data set
-Prepares for modeling in subsequent phases
-Select cases and variables appropriate for analysis
(4) Modeling Phase
-Select and apply one or more modeling techniques
-Calibrate model settings to optimize results
-If necessary, additional data preparation may be required
(5) Evaluation Phase
-Evaluate one or more models for effectiveness
-Determine whether defined objectives are achieved
-Make decision regarding data mining results before deploying to field
(6) Deployment Phase
-Make use of models created
-Simple deployment: generate report
-Complex deployment: implement additional data mining effort in another department
-In business, customer often carries out deployment based on model
How many data mining tasks?
6 data mining task
Data Mining Task: Description
-Describes general patterns and trends
-Easy to interpret and explain
-Transparent Models
-Pictures and #’s
-E.g. Scatterplots, Descriptive Stats
Data Mining Task: Estimation
-Target Variable = Numerical
-Numerical Predictor/Categorical (IV’s) values to approximate changes in Numerical Target Variables(DV’s)
-Ex: Estimate a student’s Graduate GPA from their Undergrad GPA
-E.g. Correlation, Linear Regression
Data Mining Task: Classification
-target variables (DV’s) = categorical
-Examples:
Simple vs Complex tasks
Fraudulent card transactions
Income brackets(ex. high, middle, low)
Data Mining Task: Prediction
-Results lie in the future
-There is a time component in this task
-Ex: What is the probability of Razorbacks winning a game with a particular combination of player profiles?
Data Mining Task: Association
-Finding attributes of data that go together
-Profiling relationships between two or more attributes
-Understand the consequent behaviors when based on prior behaviors
-Ex: Supermarkets use affinity analysis to see what items are purchased together
Data Mining Task: Clustering
-no target variables
-segmentation of data
-Ex: Focused marketing campaigns
Data mining Task: Learning Types
Supervised and Unsupervised
Supervised
-Have a target variable
-Task:
Classification(Categorical Target Variable)
Estimation (Numeric Target Variable)
Description
Prediction
Unsupervised
-No target variable
-Task:
Association
Clustering
Fallacy 1:
-Set of tools can be turned loose on data repositories
-Finds answers to all business problems
Reality 1:
-No automatic data mining tools solve problems
-Rather, data mining is a process (CRISP-DM)
-Integrates into overall business objectives
Fallacy 2:
-Data mining process is autonomous
-Requires little oversight
Reality 2:
-Requires significant intervention during every phase
-After model deployment, new models require updates
-Continuous evaluative measures monitored by analysts
Fallacy 3:
-Data mining quickly pays for itself
Reality 3:
-Return rates vary
-Depending on startup, personnel, data preparation costs, etc.
Fallacy 4:
Data mining software easy to use
Reality 4:
-Ease of use varies across projects
-Analysts must combine subject matter knowledge with specific problem domain
Fallacy 5:
Data mining identifies causes of business problems
Reality 5:
-Knowledge discovery process uncovers patterns of behavior
-Humans interpret results and identify causes
Fallacy 6:
-Data mining automatically cleans data in databases
Reality 6:
-Data mining often uses data from legacy systems
-Data possibly not examined or used in years
-Organizations starting data mining efforts confronted with huge data preprocessing task
Fallacy 7:
-Data mining will always yield positive results
Reality 7:
-Not guaranteed for positive results
-Can sometimes provide actionable results and improve decisions, but not always
Data preparation
60% of effort for data mining process
Data Cleaning
-Replacement Missing Value
-Normalization, converting variables to standardized scale
-Testing for Normality
-Dummy Variables
-Outliers
Why Preprocess data
-Raw data may often be incomplete, noisy
-Data often from legacy databases where values are missing or non relevant
-Data in form not suitable for data mining; Obsolete fields; Outliers
Three Alternate Methods For Replacing Data
-Replace Missing Values with User-defined Constant
-Replace Missing Values with Mode or Mean/Median
-Replace Missing Values with Random Values
Replace Values with User-Defined Constant
-Missing numeric values replaced with 0.0
-Missing categorical values replaced with “Missing
Replace Missing Values with Mode or Mean/Median
-Mode for categorical field
-Mean/Median for continuous field
Replace Missing Values with Random Values
-Values randomly taken from underlying distribution
-Method superior compared to mean substitution
-Measures of location and spread remain closer to original