4 - Unsupervised learning: clustering and association rules Flashcards

1
Q

What is datamining?

A
  • methods to extract information from large data sets using algorithms
  • considers data where each (denormalized) entry redundantly describes all attributes of an instance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is an algorithm?

A

A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Some applications of Datamining

A

CRM
- profiles, cross-selling, bundling

Web Optimisation
- click paths, search

Traffic forecasts

Fraud detection

Product Placement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Data mining methods

A
  • data pre-processing, description and reduction
  • search explanations through regression
  • segmentation through cluster analysis
  • reveal relationships through visualization, correlation, association rule learning
  • classification, discriminant analysis
  • anomaly recognition
  • prognosis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Data Mining Methods

Components from …

A

traditional statistics and data analysis
- e.g. regression, clustering, factor analysis

artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms

pattern recognition
- e.g. artificial neurons as identifiers

data base theory and practice
- e.g. association analysis, OLAP macros

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is an attribute?

A

provides information on one specific characteristic of each instance

e.g. the column stating the patients’ name for each recorded visit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an instance?

A

One specific entry in the data set

e.g. the row describing one patients’ specific visit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a data set?

A

All the data that you plan to analyse

e.g. all patients who visited this hospital during the last 5 years

Denormalised: one long table

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Evaluating Data Mining

A

To evaluate the outcome of data mining efforts, we usually split the data set into three parts:

Training set:
- Use an algorithm to generate a model from the data

Validation set:
- Validate and improve the model

Test set:
- Test the applicability and performance of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Supervised learning

A

For training, validation and test sets, the solution is known a priori

e.g. classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Unsupervised learning

A

There is no structured solution (yet)

e.g. clustering, association rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is clustering and when is it applied?

A

Unsupervised segmentation of an instance set based on attributes

Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Clustering

Conditions

A
  • instances do belong to different segments
  • instances are described by multi-dimensional attributes
  • attribute values can be quantified, normalized or weighted
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Clustering

Unique

A

Every instance belongs to exactly one cluster

e.g. bank note identification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Clustering

Overlapping

A

One Instance can belong to several clusters

e.g. expert profiles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Clustering

Probabilistic

A

Each instance has a certain probability of belonging to each cluster

e.g. customer segments

17
Q

What is hierarchical clustering?

A

Cluster on lower aggregation levels are part of clusters on higher aggregation levels

e.g. biological classification by family, genus, species

18
Q

Hierarchical clustering

Agglomerative clustering

A

Initially, every instance represents an individual cluster

-> larger clusters emerge through agglomeration

19
Q

Hierarchical clustering

Divisive clustering

A

Initially, the whole instance set is a single cluster

-> smaller clusters emerge through division

20
Q

Offline vs. online clustering

A

Offline:
- a complete set of instances is provided before clustering starts

Online:
- instances are added iteratively, clustering can be started when at least three instances are available

21
Q

Offline vs. online clustering

Online additional info

A
  • the human brain continuously clusters “online”
  • growing data sets, e.g. describing customer choices in online retailing, have to be clustered online
  • repeating offline algorithms have to be repeated for growing data sets is computationally costly
  • streaming clustering processes only the latest n instances
22
Q

k-Means clustering Algorithm

Attributes

A

unique, flat, offline

23
Q

k-Means clustering algorithm

Steps

A
  1. Let k be the number of desired clusters
  2. Randomly select k cluster centrpods
  3. assign all instances to the nearest cluster centroid
  4. calculate means for the attributes of instances in one cluster
  5. set means as new cluster centroids
  6. assign all instances to the nearest cluster centroid
  7. if 6. leads to the same results as 3.:Successful exit
    Else: go to 4, calculate new means
24
Q

k-Means Clustering algorithm

Improvements, variations, problem

A

Improvements:
Do not randomly choose initial centroids, but according to distance variance: k-means++

Variations:
Diverse distance measures depending on attribute types

Problem:
What is the best k?

25
Q

Cluster evaluation criteria

Which different indicators are there?

A
Elbow
S_Dbw
Silhouette
Calinski-Harabasz
Dunn
26
Q

Cluster evaluation criteria

Elbow

How to:
Pro:
Contra:

A

How to:
- Compute gain in explained variance per increase of k, plot and look for “elbow”

Pro:

  • accessible input data
  • robust against noise in the data

Contra:
- does not consider special data set features

27
Q

Cluster evaluation criteria

S_Dbw

How to:
Pro:
Contra:

A

How to:
- measure dispersion within a cluster and density between cluster

Pro:
- robust against most data defects

Contra:
- cannot be computed for some parametrizations

28
Q

Cluster evaluation criteria

Silhouette

How to:
Pro:
Contra:

A

How to:
- measures how similar each data point is to other data points in the same cluster and in other clusters

Pro:
- common and well-grounded in research

Contra:

  • not robust against sub clusters in data
  • complex input parameters
29
Q

Cluster evaluation criteria

Calinski-Harabasz

How to:
Pro:
Contra:

A

How to:
- weighted ratio of the between-cluster sum-of-squares and the within cluster sum-of-squares

Pro:

  • accessible input data
  • robust against noise in the data
  • weights based on dimensions of the data

Contra:
- misleading for skewed data

30
Q

Cluster evaluation criteria

Dunn

How to:
Pro:
Contra:

A

How to:
- measures the inter-cluster distance (should be large) and the intra-cluster distance (should be small)

Pro:
- measure for separation and scattering of data

Contra:

  • computationally inefficient for large and noisy data sets
  • not robust against skewed distributed data
31
Q

Clustering: before and after

A
  1. Pre-Process data:
    - select relevant attributes
    - test for dependencies
    - normalize and weight attribute values
  2. Cluster
    - try out various numbers of clusters
    - consider different initial seeds
    - consider different distance measures and algorithms
  3. Process results
    - label data
    - visualize clusters
    - predict cluster adherence
32
Q

Association rule learning

A
  • derives rules between instance attributes based on sufficient observations, e.g. for market basket analyses
  • no a priori assumptions needed - use rules that have the best evaluation: confidence, support, lift, expected confidence
33
Q

Association rule

Definition

A

A->B includes a set of attributes values A in the antecedent and a set of attribute values B in the consequent, where A and B are disjunct

  • Association rules are it-then conditions, that are probability- rather than logic-based
34
Q

Support

Definition

A

Share of instances in the data set fulfill the rule

  • measures the relevance of a rule
35
Q

Confidence

Definition

A

Share of instances that include attribute values A and B based on the set of instances that fulfill antecedent A

  • measures the accuracy of a rule
36
Q

Lift

Definition

A

Ratio of the observed support to that expected if A and B were independent

37
Q

Conclusion: Association Rule Learning

A
Rules can describe combinations of attribute values
-> browsed websites
-> bought items
-> typed keywords
Simple, easy-to-understand rules

Challenge: high number of potential rules
Solution: filtering, by
- support and confidence
- templates (“rule has to include [milk]”)
- visualization: support diagrams, rule graphs