W04 Unsupervised Learning Flashcards

1
Q

Applications of Data Mining

A
Product Placement
Fraud Detection
Traffic Forecasts
Web Optimization
Customer Relationship Management
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Mining Methods

A

pre-processing, description and reduction

visualization, correlation, association rule learning to show relationships

explanation through regression

classification, discriminant analysis

anomaly recognition

prognosis

segmentation through cluster analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Terminology of a Dataset table

A

Attribute (columns)

Instance (rows)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Evaluating Datamining

Split?

A
Training Set
-generate model from the data
Validation Set
-validate and improve the model
Test Set
-test applicability and performance of model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Supervised Learning

A

For all sets, solution is known a priori

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Unsupervised Learning

A

No strucutral solution yet (clustering, association rules)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Clustering
what?
Conditions?
Applications?

A

unsupervised segmentation of an instance set based on attributes

instances belong to different segments
described by multi-dimensional attributes
attribute values can be quantified

Instance segments are not known a priori but possibility of segmentation is expected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Kinds of clusters

A

unique (one instance one lcuster)

overlapping (one instance multiple clusters)

probabilistic (one instance, one probability for cluster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hierarchical Clustering

A

agglomerative: from n clusters to one cluster
divisive: from one cluster to n clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Offline vs Online Clustering

A

Offline: complete set of instances known prior to clustering

Online: added iteratively

  • growing datasets
  • more efficient
  • streaming clustering processes only latest instances
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

k-Means Clustering Algorithm

A
  • unique; flat; offline
  • k-means++ non-random initial
  • best k?
  • variations: distance measures

1 k - number of desired clusters

2 randomly select k cluster centroids

3 assign all instances to nearest cluster centroid

4 calculate means for attributes of instances in one cluster

5 set means as new centroids

6 assign again

if 6 and 3 are same result_ convergence! else 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Cluster Evaluation Criteria

A

-Elbow (compute gain in explained variance per increase of k; plot and look for elbow)

Dunn (intra-cluster-distance vs intra-cluster distances)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Before and After Clustering

A
before:
pre-process data
-relevant attributes
-dependencies
normalize and weight values

cluster:

  • various number of clusters
  • different intial seeds
  • different distance measures and algorithms

process results:

  • label data
  • visualize clusters
  • predict cluster adherence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Association Rules

A

Derive rules between instanc attributes based on sufficient observations, e.g. for market basket analyses

no a priori assumptions needed.
evaluate via 
confidence
support
lift
expected confidence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Market Basket Analyses

A

transaction data protocolled at check out

record buy instance, branch, article, quantity, date

lots of data

what articles are bought together?

-shoes and socks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Association Rule

A

Antecedent A leads to consequent B

-based on probability rather than logic

17
Q

Support

A

share of instances in dataset that fulfill the rule

18
Q

Confidence

A

Share of instances that include A and B based on set of instances that fulfil antecedent A

19
Q

Lift

A

ratio of observed support to that expected if and B were independent

20
Q

Association Rule Learning
what?
challenge?
solution?

A

associate attribute values with each-other by simple rules

many rules possible

filter by support and confidence and visualization