- methods to extract information from large data sets using algorithms - considers data where each (denormalized) entry redundantly describes all attributes of an instance

- data pre-processing, description and reduction - search explanations through regression - segmentation through cluster analysis - reveal relationships through visualization, correlation, association rule learning - classification, discriminant analysis - anomaly recognition - prognosis

4 - Unsupervised learning: clustering and association rules Flashcards by Jennifer Schuster

What is datamining?

methods to extract information from large data sets using algorithms
considers data where each (denormalized) entry redundantly describes all attributes of an instance

How well did you know this?

Not at all

Perfectly

What is an algorithm?

A process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer

How well did you know this?

Not at all

Perfectly

Some applications of Datamining

CRM
- profiles, cross-selling, bundling

Web Optimisation
- click paths, search

Traffic forecasts

Fraud detection

Product Placement

How well did you know this?

Not at all

Perfectly

Data mining methods

data pre-processing, description and reduction
search explanations through regression
segmentation through cluster analysis
reveal relationships through visualization, correlation, association rule learning
classification, discriminant analysis
anomaly recognition
prognosis

How well did you know this?

Not at all

Perfectly

Data Mining Methods

Components from …

traditional statistics and data analysis
- e.g. regression, clustering, factor analysis

artificial intelligence
- e.g. machine learning, artificial neural networks, evolutionary algorithms

pattern recognition
- e.g. artificial neurons as identifiers

data base theory and practice
- e.g. association analysis, OLAP macros

How well did you know this?

Not at all

Perfectly

What is an attribute?

provides information on one specific characteristic of each instance

e.g. the column stating the patients’ name for each recorded visit

How well did you know this?

Not at all

Perfectly

What is an instance?

One specific entry in the data set

e.g. the row describing one patients’ specific visit

How well did you know this?

Not at all

Perfectly

What is a data set?

All the data that you plan to analyse

e.g. all patients who visited this hospital during the last 5 years

Denormalised: one long table

How well did you know this?

Not at all

Perfectly

Evaluating Data Mining

To evaluate the outcome of data mining efforts, we usually split the data set into three parts:

Training set:
- Use an algorithm to generate a model from the data

Validation set:
- Validate and improve the model

Test set:
- Test the applicability and performance of the model

How well did you know this?

Not at all

Perfectly

Supervised learning

For training, validation and test sets, the solution is known a priori

e.g. classification

How well did you know this?

Not at all

Perfectly

Unsupervised learning

There is no structured solution (yet)

e.g. clustering, association rules

How well did you know this?

Not at all

Perfectly

What is clustering and when is it applied?

Unsupervised segmentation of an instance set based on attributes

Is applied, when instance segments are not known a priori, but the possibility of segmentation is expected

How well did you know this?

Not at all

Perfectly

Clustering

Conditions

instances do belong to different segments
instances are described by multi-dimensional attributes
attribute values can be quantified, normalized or weighted

How well did you know this?

Not at all

Perfectly

Clustering

Unique

Every instance belongs to exactly one cluster

e.g. bank note identification

How well did you know this?

Not at all

Perfectly

Clustering

Overlapping

One Instance can belong to several clusters

e.g. expert profiles

How well did you know this?

Not at all

Perfectly

Clustering

Probabilistic

Study These Flashcards

Each instance has a certain probability of belonging to each cluster

e.g. customer segments

What is hierarchical clustering?

Study These Flashcards

Cluster on lower aggregation levels are part of clusters on higher aggregation levels

e.g. biological classification by family, genus, species

Hierarchical clustering

Agglomerative clustering

Study These Flashcards

Initially, every instance represents an individual cluster

-> larger clusters emerge through agglomeration

Hierarchical clustering

Divisive clustering

Study These Flashcards

Initially, the whole instance set is a single cluster

-> smaller clusters emerge through division

Offline vs. online clustering

Study These Flashcards

Offline:
- a complete set of instances is provided before clustering starts

Online:
- instances are added iteratively, clustering can be started when at least three instances are available

Offline vs. online clustering

Online additional info

Study These Flashcards

the human brain continuously clusters “online”
growing data sets, e.g. describing customer choices in online retailing, have to be clustered online
repeating offline algorithms have to be repeated for growing data sets is computationally costly
streaming clustering processes only the latest n instances

k-Means clustering Algorithm

Attributes

Study These Flashcards

unique, flat, offline

k-Means clustering algorithm

Steps

Study These Flashcards

Let k be the number of desired clusters
Randomly select k cluster centrpods
assign all instances to the nearest cluster centroid
calculate means for the attributes of instances in one cluster
set means as new cluster centroids
assign all instances to the nearest cluster centroid
if 6. leads to the same results as 3.:Successful exit
Else: go to 4, calculate new means

k-Means Clustering algorithm

Improvements, variations, problem

Study These Flashcards

Improvements:
Do not randomly choose initial centroids, but according to distance variance: k-means++

Variations:
Diverse distance measures depending on attribute types

Problem:
What is the best k?

Cluster evaluation criteria Which different indicators are there?

``` Elbow S_Dbw Silhouette Calinski-Harabasz Dunn ```

Cluster evaluation criteria Elbow How to: Pro: Contra:

How to: - Compute gain in explained variance per increase of k, plot and look for "elbow" Pro: - accessible input data - robust against noise in the data Contra: - does not consider special data set features

Cluster evaluation criteria S_Dbw How to: Pro: Contra:

How to: - measure dispersion within a cluster and density between cluster Pro: - robust against most data defects Contra: - cannot be computed for some parametrizations

Cluster evaluation criteria Silhouette How to: Pro: Contra:

How to: - measures how similar each data point is to other data points in the same cluster and in other clusters Pro: - common and well-grounded in research Contra: - not robust against sub clusters in data - complex input parameters

Cluster evaluation criteria Calinski-Harabasz How to: Pro: Contra:

How to: - weighted ratio of the between-cluster sum-of-squares and the within cluster sum-of-squares Pro: - accessible input data - robust against noise in the data - weights based on dimensions of the data Contra: - misleading for skewed data

Cluster evaluation criteria Dunn How to: Pro: Contra:

How to: - measures the inter-cluster distance (should be large) and the intra-cluster distance (should be small) Pro: - measure for separation and scattering of data Contra: - computationally inefficient for large and noisy data sets - not robust against skewed distributed data

Clustering: before and after

1. Pre-Process data: - select relevant attributes - test for dependencies - normalize and weight attribute values 2. Cluster - try out various numbers of clusters - consider different initial seeds - consider different distance measures and algorithms 3. Process results - label data - visualize clusters - predict cluster adherence

Association rule learning

- derives rules between instance attributes based on sufficient observations, e.g. for market basket analyses - no a priori assumptions needed - use rules that have the best evaluation: confidence, support, lift, expected confidence

Association rule Definition

A->B includes a set of attributes values A in the antecedent and a set of attribute values B in the consequent, where A and B are disjunct - Association rules are it-then conditions, that are probability- rather than logic-based

Support Definition

Share of instances in the data set fulfill the rule - measures the relevance of a rule

Confidence Definition

Share of instances that include attribute values A and B based on the set of instances that fulfill antecedent A - measures the accuracy of a rule

Lift Definition

Ratio of the observed support to that expected if A and B were independent

Conclusion: Association Rule Learning

``` Rules can describe combinations of attribute values -> browsed websites -> bought items -> typed keywords Simple, easy-to-understand rules ``` Challenge: high number of potential rules Solution: filtering, by - support and confidence - templates ("rule has to include [milk]") - visualization: support diagrams, rule graphs

4 - Unsupervised learning: clustering and association rules Flashcards

(37 cards)