4 - Unsupervised learning: clustering and association rules Flashcards
What is datamining?
- methods to extract information from large data sets using algorithms
- considers data where each (denormalized) entry redundantly describes all attributes of an instance
What is an algorithm?
A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer
Some applications of Datamining
CRM
- profiles, cross-selling, bundling
Web Optimisation
- click paths, search
Traffic forecasts
Fraud detection
Product Placement
Data mining methods
- data pre-processing, description and reduction
- search explanations through regression
- segmentation through cluster analysis
- reveal relationships through visualization, correlation, association rule learning
- classification, discriminant analysis
- anomaly recognition
- prognosis
Data Mining Methods
Components from …
traditional statistics and data analysis
- e.g. regression, clustering, factor analysis
artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms
pattern recognition
- e.g. artificial neurons as identifiers
data base theory and practice
- e.g. association analysis, OLAP macros
What is an attribute?
provides information on one specific characteristic of each instance
e.g. the column stating the patients’ name for each recorded visit
What is an instance?
One specific entry in the data set
e.g. the row describing one patients’ specific visit
What is a data set?
All the data that you plan to analyse
e.g. all patients who visited this hospital during the last 5 years
Denormalised: one long table
Evaluating Data Mining
To evaluate the outcome of data mining efforts, we usually split the data set into three parts:
Training set:
- Use an algorithm to generate a model from the data
Validation set:
- Validate and improve the model
Test set:
- Test the applicability and performance of the model
Supervised learning
For training, validation and test sets, the solution is known a priori
e.g. classification
Unsupervised learning
There is no structured solution (yet)
e.g. clustering, association rules
What is clustering and when is it applied?
Unsupervised segmentation of an instance set based on attributes
Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected
Clustering
Conditions
- instances do belong to different segments
- instances are described by multi-dimensional attributes
- attribute values can be quantified, normalized or weighted
Clustering
Unique
Every instance belongs to exactly one cluster
e.g. bank note identification
Clustering
Overlapping
One Instance can belong to several clusters
e.g. expert profiles