4 - Unsupervised learning: clustering and association rules Flashcards
What is datamining?
- methods to extract information from large data sets using algorithms
- considers data where each (denormalized) entry redundantly describes all attributes of an instance
What is an algorithm?
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer
Some applications of Datamining
CRM
- profiles, cross-selling, bundling
Web Optimisation
- click paths, search
Traffic forecasts
Fraud detection
Product Placement
Data mining methods
- data pre-processing, description and reduction
- search explanations through regression
- segmentation through cluster analysis
- reveal relationships through visualization, correlation, association rule learning
- classification, discriminant analysis
- anomaly recognition
- prognosis
Data Mining Methods
Components from …
traditional statistics and data analysis
- e.g. regression, clustering, factor analysis
artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms
pattern recognition
- e.g. artificial neurons as identifiers
data base theory and practice
- e.g. association analysis, OLAP macros
What is an attribute?
provides information on one specific characteristic of each instance
e.g. the column stating the patients’ name for each recorded visit
What is an instance?
One specific entry in the data set
e.g. the row describing one patients’ specific visit
What is a data set?
All the data that you plan to analyse
e.g. all patients who visited this hospital during the last 5 years
Denormalised: one long table
Evaluating Data Mining
To evaluate the outcome of data mining efforts, we usually split the data set into three parts:
Training set:
- Use an algorithm to generate a model from the data
Validation set:
- Validate and improve the model
Test set:
- Test the applicability and performance of the model
Supervised learning
For training, validation and test sets, the solution is known a priori
e.g. classification
Unsupervised learning
There is no structured solution (yet)
e.g. clustering, association rules
What is clustering and when is it applied?
Unsupervised segmentation of an instance set based on attributes
Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected
Clustering
Conditions
- instances do belong to different segments
- instances are described by multi-dimensional attributes
- attribute values can be quantified, normalized or weighted
Clustering
Unique
Every instance belongs to exactly one cluster
e.g. bank note identification
Clustering
Overlapping
One Instance can belong to several clusters
e.g. expert profiles
Clustering
Probabilistic
Each instance has a certain probability of belonging to each cluster
e.g. customer segments
What is hierarchical clustering?
Cluster on lower aggregation levels are part of clusters on higher aggregation levels
e.g. biological classification by family, genus, species
Hierarchical clustering
Agglomerative clustering
Initially, every instance represents an individual cluster
-> larger clusters emerge through agglomeration
Hierarchical clustering
Divisive clustering
Initially, the whole instance set is a single cluster
-> smaller clusters emerge through division
Offline vs. online clustering
Offline:
- a complete set of instances is provided before clustering starts
Online:
- instances are added iteratively, clustering can be started when at least three instances are available
Offline vs. online clustering
Online additional info
- the human brain continuously clusters “online”
- growing data sets, e.g. describing customer choices in online retailing, have to be clustered online
- repeating offline algorithms have to be repeated for growing data sets is computationally costly
- streaming clustering processes only the latest n instances
k-Means clustering Algorithm
Attributes
unique, flat, offline
k-Means clustering algorithm
Steps
- Let k be the number of desired clusters
- Randomly select k cluster centrpods
- assign all instances to the nearest cluster centroid
- calculate means for the attributes of instances in one cluster
- set means as new cluster centroids
- assign all instances to the nearest cluster centroid
- if 6. leads to the same results as 3.:Successful exit
Else: go to 4, calculate new means
k-Means Clustering algorithm
Improvements, variations, problem
Improvements:
Do not randomly choose initial centroids, but according to distance variance: k-means++
Variations:
Diverse distance measures depending on attribute types
Problem:
What is the best k?
Cluster evaluation criteria
Which different indicators are there?
Elbow S_Dbw Silhouette Calinski-Harabasz Dunn
Cluster evaluation criteria
Elbow
How to:
Pro:
Contra:
How to:
- Compute gain in explained variance per increase of k, plot and look for “elbow”
Pro:
- accessible input data
- robust against noise in the data
Contra:
- does not consider special data set features
Cluster evaluation criteria
S_Dbw
How to:
Pro:
Contra:
How to:
- measure dispersion within a cluster and density between cluster
Pro:
- robust against most data defects
Contra:
- cannot be computed for some parametrizations
Cluster evaluation criteria
Silhouette
How to:
Pro:
Contra:
How to:
- measures how similar each data point is to other data points in the same cluster and in other clusters
Pro:
- common and well-grounded in research
Contra:
- not robust against sub clusters in data
- complex input parameters
Cluster evaluation criteria
Calinski-Harabasz
How to:
Pro:
Contra:
How to:
- weighted ratio of the between-cluster sum-of-squares and the within cluster sum-of-squares
Pro:
- accessible input data
- robust against noise in the data
- weights based on dimensions of the data
Contra:
- misleading for skewed data
Cluster evaluation criteria
Dunn
How to:
Pro:
Contra:
How to:
- measures the inter-cluster distance (should be large) and the intra-cluster distance (should be small)
Pro:
- measure for separation and scattering of data
Contra:
- computationally inefficient for large and noisy data sets
- not robust against skewed distributed data
Clustering: before and after
- Pre-Process data:
- select relevant attributes
- test for dependencies
- normalize and weight attribute values - Cluster
- try out various numbers of clusters
- consider different initial seeds
- consider different distance measures and algorithms - Process results
- label data
- visualize clusters
- predict cluster adherence
Association rule learning
- derives rules between instance attributes based on sufficient observations, e.g. for market basket analyses
- no a priori assumptions needed - use rules that have the best evaluation: confidence, support, lift, expected confidence
Association rule
Definition
A->B includes a set of attributes values A in the antecedent and a set of attribute values B in the consequent, where A and B are disjunct
- Association rules are it-then conditions, that are probability- rather than logic-based
Support
Definition
Share of instances in the data set fulfill the rule
- measures the relevance of a rule
Confidence
Definition
Share of instances that include attribute values A and B based on the set of instances that fulfill antecedent A
- measures the accuracy of a rule
Lift
Definition
Ratio of the observed support to that expected if A and B were independent
Conclusion: Association Rule Learning
Rules can describe combinations of attribute values -> browsed websites -> bought items -> typed keywords Simple, easy-to-understand rules
Challenge: high number of potential rules
Solution: filtering, by
- support and confidence
- templates (“rule has to include [milk]”)
- visualization: support diagrams, rule graphs