Lecture 5 (Data Mining) Flashcards
Why Data Mining?
More intense competition
Recognition of the value in data sources
Availability of quality data on customers, vendors, transactions
Definition of Data Mining?
The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases.
Data Mining Characteristics and Objectives?
Source of data for DM is often consolidated data warehouse
DM environment is usually a client-server or a Web-based information system architecture
Data is the most critical ingredient for DM which may include soft/unstructured data
The miner is the end user
How data mining works?
DM extract patterns from data
Types of patterns in data mining?
Association
Prediction
Cluster
Sequential
Association methods?
Market-basket
Link analysis
Sequence analysis
Prediction methods?
Classification
Regression
Time Series
Segmentation methods?
Clustering
Outlier analysis
Supervised Learning problems?
Classification
- The domain of the target is finite and categorical
- A classifier must assign a class to an unseen example
Regression
- The target attribute is formed by infinite values
- To fit a model to learn the output target attribute as a function of input attributes
Time Series Analysis
- Making predictions in time
Unsupervised Learning Problems?
Clustering
Association Rules
Pattern Mining
- It is adopted as more general term than frequent pattern mining or association mining
Outlier Detection
- Ot is the process of finding data examples with behaviours that are very different from the expectation
Data Mining Applications?
Customer Relationship Management Banking and Other Financial Retailing and Logistics Manufacturing and Maintenance Brokerage and Securities Trading Insurance Computer Hardware and Software Science and Engineering Government and Defense Homeland security and law enforcement Travel, entertainment, sports Healthcare and medicine Sports, virtually everywhere
Customer Relationship Management?
Maximize return on marketing campaigns
Improve customer retention
Maximize customer value
Identify and treat most valued customers
Banking and Other Financial?
Automate the loan application process
Detecting fraudulent transactions
Maximize customer value
Optimizing cash reserves with forecasting
Retailing and Logistics?
Optimize inventory levels at different locations
Improve the store layout and sales promotions
Optimize logistics by predicting seasonal effects
Minimize losses due to limited shelf life
Manufacturing and Maintenance?
Predict/prevent machinery failures
Identify anomalies in production systems to optimize the use manufacturing capacity
Discover novel patterns to improve product quality
Brokerage and Securities Trading?
Predict changes on certain bond prices
Forecast the direction of stock fluctuations
Assess the effect of events of market movements
Identify and prevent fraudulent activities in trading
Insurance?
Forecast claim costs for better business planning
Determine the optimal rate plans
Optimize marketing to specific customers
Identify and prevent fraudulent claim activities
Data mining process?
A manifestation of best practices
A systematic way to conduct DM projects
Moving from Art to Science for DM project
Everybody has a different vision
Most common standard processes of Data Mining?
CRISP-DM
SEMMA
KDD
CRISP-DM?
Cross Industry Standard Process for Data Mining
Proposed in 1990s by European consortium
Steps of CRISP-DM?
Business Understanding Data Understanding Data Preparation Model Building Testing and Evaluation Deployment
SEMMA?
Sample Explore Modify Model Assess
KDD?
Knowledge Discovery in Databases
Steps to KDD?
Data selection Data cleaning Data transformation Data mining Internalization
Examples of Classification Task?
Predicting tumor cells as benign or malignant
Classifying credit card transactions as legitimate or fraudulent
Classifying secondary structures of protein as alpha-helix
Classification Techniques?
Decision tree based methods Rule-based methods Neural Networks Naive Bayes and Bayesian Belief Networks Support Vector Machines
Pros of KNN?
Simple
Flexible
Excellent performance on a wide range of tasks
Cons of KNN?
Time consuming with n training points
Memorization, not learning.
No insight into the domain
Assessment Methods for Classification?
Predictive accuracy
- Hit rate
Speed - Model building versus predicting/usage speed Robustness Scalability Interpretability
In classification problems, the primary source for accuracy estimation is the?
Confusion Matrix