W04 Unsupervised Learning Flashcards
Applications of Data Mining
Product Placement Fraud Detection Traffic Forecasts Web Optimization Customer Relationship Management
Data Mining Methods
pre-processing, description and reduction
visualization, correlation, association rule learning to show relationships
explanation through regression
classification, discriminant analysis
anomaly recognition
prognosis
segmentation through cluster analysis
Terminology of a Dataset table
Attribute (columns)
Instance (rows)
Evaluating Datamining
Split?
Training Set -generate model from the data Validation Set -validate and improve the model Test Set -test applicability and performance of model
Supervised Learning
For all sets, solution is known a priori
Unsupervised Learning
No strucutral solution yet (clustering, association rules)
Clustering
what?
Conditions?
Applications?
unsupervised segmentation of an instance set based on attributes
instances belong to different segments
described by multi-dimensional attributes
attribute values can be quantified
Instance segments are not known a priori but possibility of segmentation is expected
Kinds of clusters
unique (one instance one lcuster)
overlapping (one instance multiple clusters)
probabilistic (one instance, one probability for cluster)
Hierarchical Clustering
agglomerative: from n clusters to one cluster
divisive: from one cluster to n clusters
Offline vs Online Clustering
Offline: complete set of instances known prior to clustering
Online: added iteratively
- growing datasets
- more efficient
- streaming clustering processes only latest instances
k-Means Clustering Algorithm
- unique; flat; offline
- k-means++ non-random initial
- best k?
- variations: distance measures
1 k - number of desired clusters
2 randomly select k cluster centroids
3 assign all instances to nearest cluster centroid
4 calculate means for attributes of instances in one cluster
5 set means as new centroids
6 assign again
if 6 and 3 are same result_ convergence! else 4
Cluster Evaluation Criteria
-Elbow (compute gain in explained variance per increase of k; plot and look for elbow)
Dunn (intra-cluster-distance vs intra-cluster distances)
Before and After Clustering
before: pre-process data -relevant attributes -dependencies normalize and weight values
cluster:
- various number of clusters
- different intial seeds
- different distance measures and algorithms
process results:
- label data
- visualize clusters
- predict cluster adherence
Association Rules
Derive rules between instanc attributes based on sufficient observations, e.g. for market basket analyses
no a priori assumptions needed. evaluate via confidence support lift expected confidence
Market Basket Analyses
transaction data protocolled at check out
record buy instance, branch, article, quantity, date
lots of data
what articles are bought together?
-shoes and socks.